From Static Sitemaps to Smart Crawlers: A Real Estate Crawling Upgrade

Web crawlers that depend on static XML sitemaps often run into stale or incomplete data. This was exactly the case with our existing system, which relied on a sitemap to fetch real estate listings for a platform. As the sitemap stopped updating, only old listings were being crawled.

In this post, we’ll walk through the process of modernizing that crawler by:

  • Switching from sitemap crawling to dynamic HTML search pages
  • Filtering pages by postal code per customer
  • Respecting pagination with smart stopping logic
  • Making the entire solution Django 1.8 + Python 2.7 compatible

Why Move Away From Sitemaps?

Real estate platforms don’t always keep sitemaps updated. Plus, sitemaps rarely offer filtered data. Instead of fetching thousands of unnecessary listings, we use filtered search result pages directly, such as:

https://example.com/search/house/for-sale/city/2620?page=1&orderBy=newest
  

This format lets us target exactly what the customer wants: specific postal codes, filtered by type and location.

How Our Crawler Works Now

  1. Start with customer configurations that define postcodes and contact preferences.
  2. For each postcode, generate a dynamic search URL and save it (if new) to the DataUrl table.
  3. Crawl the first few pages of each URL, stopping if no results are found.
  4. Queue each listing URL into a Queue model for further detail scraping.

Core Model Structure

class CustomerConfiguration(models.Model):
    name = models.CharField(max_length=255, unique=True)
    api_base_url = models.URLField(max_length=500)
    api_token = models.CharField(max_length=255)
    post_codes = models.TextField(help_text="Comma-separated list of postcodes")
    contact_types = models.TextField()
    is_active = models.BooleanField(default=True)
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)
  

Generating URLs From Customers

The following Django command loops through all active customers, reads their postcodes, and generates appropriate URLs for each:

# crawler/management/commands/add_immoweb_urls_all_customers.py
from __future__ import unicode_literals
import sys
from django.core.management.base import BaseCommand
from crawler.models import DataUrl, Source
from offer_sync.models import CustomerConfiguration

class Command(BaseCommand):
    help = 'Generate listing start URLs for each customer based on postcodes'

    def handle(self, *args, **options):
        try:
            source = Source.objects.get(name='immowebnew')
        except Source.DoesNotExist:
            print("ERROR: Source 'immowebnew' not found.")
            sys.exit(1)

        customers = CustomerConfiguration.objects.filter(is_active=True)
        if not customers.exists():
            print("No active customers.")
            sys.exit(0)

        for customer in customers:
            raw_postcodes = customer.post_codes.strip()
            if not raw_postcodes:
                print("⚠ No postcodes for customer: %s" % customer.name)
                continue

            print("📦 Customer: %s" % customer.name)

            postcodes = [p.strip() for p in raw_postcodes.split(",") if p.strip().isdigit()]
            for postcode in postcodes:
                url = u"https://example.com/search/house/for-sale/city/{0}?countries=BE&page=1&orderBy=newest".format(postcode)

                try:
                    if not DataUrl.objects.filter(start_url=url, source=source).exists():
                        DataUrl.objects.create(
                            source=source,
                            start_url=url,
                            new=True
                        )
                        print("  ✅ Added: %s" % url)
                    else:
                        print("  ⏩ Already exists: %s" % url)
                except Exception as e:
                    print("  ❌ Error on %s: %s" % (url, e))

Smart Pagination in the Crawler

To avoid endless paging and platform throttling, our crawler includes a loop like this:

page = 1
max_pages = 10
while page <= max_pages:
    response = get_response(build_url(base_url, page))
    listings = extract_listings(response.text)

    if not listings:
        print("No more listings at page %d" % page)
        break

    for url in listings:
        add_to_queue(url)

    page += 1
    sleep(3)  # Be polite

Final Thoughts

Migrating away from sitemaps and toward dynamic, filtered, customer-aware crawling has made our system faster, more relevant, and easier to maintain. Customers get results they care about, and we avoid wasting bandwidth and processing on irrelevant data.

This pattern can be applied to any structured platform where listings are filterable by region, category, or date. By combining Django models, smart URL generation, and crawl queues, we built a modern, scalable solution.

Comments