🧹 Cleaning Up Job Descriptions in Django: Removing Tags and URLs

In maintaining a job listing platform, data cleanliness is key to providing a consistent and professional user experience. One recurring issue in our platform involved job descriptions (Job.rewrite) filled with unnecessary anchor (<a>) tags and external URLs—often copied directly from third-party sources.

This post walks through the Django management command we used to clean up these descriptions. It also reflects on what went well and what could be improved.

🛠 The Problem

Some job descriptions included:

<a> tags linking to external sites (sometimes broken).
Raw URLs pasted directly in the content.
Malformed or missing data in Job.rewrite.

We needed a simple, repeatable way to sanitize these fields without manually editing thousands of records.

🔄 The Approach

We built a custom Django management command that:

Loops over all jobs with status = 0.
Removes all <a> tags using BeautifulSoup.
Strips out URLs using a regular expression.
Handles edge cases like None or empty fields.
Logs any traceback if something goes wrong.

Here’s the command script we used:


import re
import traceback
from bs4 import BeautifulSoup

from django.core.management.base import BaseCommand
from job.models import Job


class Command(BaseCommand):
    help = "Cleans job descriptions by removing <a> tags and URLs."

    def handle(self, *args, **options):
        jobs = Job.objects.filter(status__exact=0)
        for job in jobs:
            try:
                self.clean_job_description(job)
            except Exception as e:
                print(f'Error processing job {job.id} - {job.title}')
                traceback.print_exc()

    def clean_job_description(self, job):
        if not job.rewrite:
            print(f"Skipping job {job.id} - empty rewrite field")
            return

        try:
            text_without_tags = self.remove_tags(job.rewrite)
        except Exception as e:
            print(f"Error in remove_tags() for job {job.id}")
            traceback.print_exc()
            return

        try:
            text_without_urls = self.remove_urls(text_without_tags)
        except Exception as e:
            print(f"Error in remove_urls() for job {job.id}")
            traceback.print_exc()
            return

        job.rewrite = text_without_urls
        job.save()
        print(f"Cleaned job ID {job.id}")

    def remove_tags(self, html):
        if html is None:
            raise ValueError("HTML content is None")
        soup = BeautifulSoup(html, 'html.parser')
        for a_tag in soup.find_all('a'):
            a_tag.decompose()
        return str(soup)

    def remove_urls(self, text):
        url_pattern = r'https?://\S+|www\.\S+'
        return re.sub(url_pattern, '', text)
        
        
   or maybe 
   
   	def remove_urls(self, text):
    # This pattern matches:
    # - http://, https://, ftp://
    # - www.example.com
    # - example.com or sub.example.co.uk
    # - domain with paths, query strings, fragments, or ports
    url_pattern = r"""(?xi)
        \b                                # Word boundary
        (                                 # Capture group
          (?:http|https|ftp)://         # Match http, https, or http2
          [\w.-]+                         # Domain or IP
          (?:\:\d+)?                      # Optional port
          (?:/[^\s]*)?                    # Optional path
        |
          www\.[\w.-]+(?:/[^\s]*)?        # www.example.com with optional path
        |
          [\w.-]+\.(?:[a-z]{2,})(?:/[^\s]*)? # example.com, sub.example.org/path
        )
    """

    return re.sub(url_pattern, '', text)

✅ What Went Well

Separation of logic: Each task (tag removal, URL removal) lives in its own function.
Resilience: We added try/except blocks and traceback.print_exc() for detailed error logs.
Simplicity: This command is easy to run on demand without extra setup.

🔍 What Could Be Better

Logging: Use Django's logging instead of print for production-readiness.
Unit Testing: Add tests for remove_tags() and remove_urls().
Performance: For large datasets, consider batch processing or asynchronous execution with Celery.

Better Filtering: Exclude empty or null descriptions directly in the query:

Job.objects.filter(status=0).exclude(rewrite__isnull=True).exclude(rewrite__exact='')

HTML Tidy-up: Add post-cleanup formatting like whitespace normalization or HTML validation.

📎 Final Thoughts

Management commands in Django are perfect for these types of maintenance tasks. If you work with user-generated content, don’t wait for a manual cleanup—automate it, track it, and iterate. A clean database makes everyone’s job easier—from search indexing to frontend rendering.

Search This Blog