In maintaining a job listing platform, data cleanliness is key to providing a consistent and professional user experience.
One recurring issue in our platform involved job descriptions (Job.rewrite
) filled with unnecessary anchor
(<a>
) tags and external URLs—often copied directly from third-party sources.
This post walks through the Django management command we used to clean up these descriptions. It also reflects on what went well and what could be improved.
🛠 The Problem
Some job descriptions included:
<a>
tags linking to external sites (sometimes broken).- Raw URLs pasted directly in the content.
- Malformed or missing data in
Job.rewrite
.
We needed a simple, repeatable way to sanitize these fields without manually editing thousands of records.
🔄 The Approach
We built a custom Django management command that:
- Loops over all jobs with
status = 0
. - Removes all
<a>
tags using BeautifulSoup. - Strips out URLs using a regular expression.
- Handles edge cases like
None
or empty fields. - Logs any traceback if something goes wrong.
Here’s the command script we used:
import re
import traceback
from bs4 import BeautifulSoup
from django.core.management.base import BaseCommand
from job.models import Job
class Command(BaseCommand):
help = "Cleans job descriptions by removing <a> tags and URLs."
def handle(self, *args, **options):
jobs = Job.objects.filter(status__exact=0)
for job in jobs:
try:
self.clean_job_description(job)
except Exception as e:
print(f'Error processing job {job.id} - {job.title}')
traceback.print_exc()
def clean_job_description(self, job):
if not job.rewrite:
print(f"Skipping job {job.id} - empty rewrite field")
return
try:
text_without_tags = self.remove_tags(job.rewrite)
except Exception as e:
print(f"Error in remove_tags() for job {job.id}")
traceback.print_exc()
return
try:
text_without_urls = self.remove_urls(text_without_tags)
except Exception as e:
print(f"Error in remove_urls() for job {job.id}")
traceback.print_exc()
return
job.rewrite = text_without_urls
job.save()
print(f"Cleaned job ID {job.id}")
def remove_tags(self, html):
if html is None:
raise ValueError("HTML content is None")
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.decompose()
return str(soup)
def remove_urls(self, text):
url_pattern = r'https?://\S+|www\.\S+'
return re.sub(url_pattern, '', text)
or maybe
def remove_urls(self, text):
# This pattern matches:
# - http://, https://, ftp://
# - www.example.com
# - example.com or sub.example.co.uk
# - domain with paths, query strings, fragments, or ports
url_pattern = r"""(?xi)
\b # Word boundary
( # Capture group
(?:http|https|ftp):// # Match http, https, or http2
[\w.-]+ # Domain or IP
(?:\:\d+)? # Optional port
(?:/[^\s]*)? # Optional path
|
www\.[\w.-]+(?:/[^\s]*)? # www.example.com with optional path
|
[\w.-]+\.(?:[a-z]{2,})(?:/[^\s]*)? # example.com, sub.example.org/path
)
"""
return re.sub(url_pattern, '', text)
✅ What Went Well
- Separation of logic: Each task (tag removal, URL removal) lives in its own function.
- Resilience: We added
try/except
blocks andtraceback.print_exc()
for detailed error logs. - Simplicity: This command is easy to run on demand without extra setup.
🔍 What Could Be Better
- Logging: Use Django's
logging
instead ofprint
for production-readiness. - Unit Testing: Add tests for
remove_tags()
andremove_urls()
. - Performance: For large datasets, consider batch processing or asynchronous execution with Celery.
- Better Filtering: Exclude empty or null descriptions directly in the query:
Job.objects.filter(status=0).exclude(rewrite__isnull=True).exclude(rewrite__exact='')
- HTML Tidy-up: Add post-cleanup formatting like whitespace normalization or HTML validation.
📎 Final Thoughts
Management commands in Django are perfect for these types of maintenance tasks. If you work with user-generated content, don’t wait for a manual cleanup—automate it, track it, and iterate. A clean database makes everyone’s job easier—from search indexing to frontend rendering.
Comments
Post a Comment