Web crawlers that depend on static XML sitemaps often run into stale or incomplete data. This was exactly the case with our existing system, which relied on a sitemap to fetch real estate listings for a platform. As the sitemap stopped updating, only old listings were being crawled.
In this post, we’ll walk through the process of modernizing that crawler by:
- Switching from sitemap crawling to dynamic HTML search pages
- Filtering pages by postal code per customer
- Respecting pagination with smart stopping logic
- Making the entire solution Django 1.8 + Python 2.7 compatible
Why Move Away From Sitemaps?
Real estate platforms don’t always keep sitemaps updated. Plus, sitemaps rarely offer filtered data. Instead of fetching thousands of unnecessary listings, we use filtered search result pages directly, such as:
https://example.com/search/house/for-sale/city/2620?page=1&orderBy=newest
This format lets us target exactly what the customer wants: specific postal codes, filtered by type and location.
How Our Crawler Works Now
- Start with customer configurations that define postcodes and contact preferences.
- For each postcode, generate a dynamic search URL and save it (if new) to the
DataUrl
table. - Crawl the first few pages of each URL, stopping if no results are found.
- Queue each listing URL into a
Queue
model for further detail scraping.
Core Model Structure
class CustomerConfiguration(models.Model):
name = models.CharField(max_length=255, unique=True)
api_base_url = models.URLField(max_length=500)
api_token = models.CharField(max_length=255)
post_codes = models.TextField(help_text="Comma-separated list of postcodes")
contact_types = models.TextField()
is_active = models.BooleanField(default=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
Generating URLs From Customers
The following Django command loops through all active customers, reads their postcodes, and generates appropriate URLs for each:
# crawler/management/commands/add_immoweb_urls_all_customers.py
from __future__ import unicode_literals
import sys
from django.core.management.base import BaseCommand
from crawler.models import DataUrl, Source
from offer_sync.models import CustomerConfiguration
class Command(BaseCommand):
help = 'Generate listing start URLs for each customer based on postcodes'
def handle(self, *args, **options):
try:
source = Source.objects.get(name='immowebnew')
except Source.DoesNotExist:
print("ERROR: Source 'immowebnew' not found.")
sys.exit(1)
customers = CustomerConfiguration.objects.filter(is_active=True)
if not customers.exists():
print("No active customers.")
sys.exit(0)
for customer in customers:
raw_postcodes = customer.post_codes.strip()
if not raw_postcodes:
print("⚠ No postcodes for customer: %s" % customer.name)
continue
print("📦 Customer: %s" % customer.name)
postcodes = [p.strip() for p in raw_postcodes.split(",") if p.strip().isdigit()]
for postcode in postcodes:
url = u"https://example.com/search/house/for-sale/city/{0}?countries=BE&page=1&orderBy=newest".format(postcode)
try:
if not DataUrl.objects.filter(start_url=url, source=source).exists():
DataUrl.objects.create(
source=source,
start_url=url,
new=True
)
print(" ✅ Added: %s" % url)
else:
print(" ⏩ Already exists: %s" % url)
except Exception as e:
print(" ❌ Error on %s: %s" % (url, e))
Smart Pagination in the Crawler
To avoid endless paging and platform throttling, our crawler includes a loop like this:
page = 1
max_pages = 10
while page <= max_pages:
response = get_response(build_url(base_url, page))
listings = extract_listings(response.text)
if not listings:
print("No more listings at page %d" % page)
break
for url in listings:
add_to_queue(url)
page += 1
sleep(3) # Be polite
Final Thoughts
Migrating away from sitemaps and toward dynamic, filtered, customer-aware crawling has made our system faster, more relevant, and easier to maintain. Customers get results they care about, and we avoid wasting bandwidth and processing on irrelevant data.
This pattern can be applied to any structured platform where listings are filterable by region, category, or date. By combining Django models, smart URL generation, and crawl queues, we built a modern, scalable solution.
Comments
Post a Comment