How We Extracted and Verified Job IDs from URLs Using Python and PostgreSQL

Recently, we needed to verify which job listings from a list of URLs were marked as payed = TRUE in our PostgreSQL database. This was part of a quality control check to ensure our archive was up to date. Here’s how we approached it using a Python script.

šŸ” The Problem

We had a text file (urls.txt) with lines like these:

https://www.domain.com/job/viewjob/20734193/...
https://www.domain.com/job/viewjob/25975411/...

Each URL contains a unique job ID, and we wanted to check which of those job IDs were already marked as paid in our PostgreSQL table called job_jobarchive.

šŸ The Python Script

Here’s the final script we used:

import re
import psycopg2

def extract_ids_from_file(file_path):
    pattern = re.compile(r'/viewjob/(\\d+)/')
    with open(file_path, 'r', encoding='utf-8') as file:
        return [match.group(1) for line in file if (match := pattern.search(line))]

def check_payed_status(job_ids, db_config):
    query = """
        SELECT job_id 
        FROM job_jobarchive 
        WHERE job_id = %s AND payed = TRUE;
    """
    paid_jobs = set()

    try:
        with psycopg2.connect(**db_config) as conn:
            with conn.cursor() as cur:
                for job_id in job_ids:
                    cur.execute(query, (job_id,))
                    result = cur.fetchone()
                    if result:
                        paid_jobs.add(result[0])
    except Exception as e:
        print("Database error:", e)
    
    return paid_jobs

if __name__ == "__main__":
    file_path = 'urls.txt'
    db_config = {
        'dbname': '',
        'user': '',
        'password': '',
        'host': 'localhost',
        'port': '5432'
    }

    job_ids = extract_ids_from_file(file_path)
    paid_jobs = check_payed_status(job_ids, db_config)

    total = len(job_ids)
    paid = len(paid_jobs)
    unpaid_ids = [jid for jid in job_ids if int(jid) not in paid_jobs]

    print(f"\\nšŸ“Š Summary Report:")
    print(f"---------------------")
    print(f"Total URLs           : {total}")
    print(f"✅ Paid job IDs       : {paid}")
    print(f"❌ Unpaid job IDs     : {len(unpaid_ids)}")

    if unpaid_ids:
        print(f"\\nList of unpaid job IDs:")
        for uid in unpaid_ids:
            print(uid)

šŸ’” The Bug We Discovered

At first, our script was showing something odd:

  • Total job IDs: 1000
  • Paid job IDs: 1000
  • Unpaid job IDs: 1000

This was clearly wrong. The mistake was due to a type mismatch. The job IDs from the URLs were strings, but the job IDs from the database were integers. When comparing them, the match failed silently.

Fix: We converted the string IDs to integers before comparing:

unpaid_ids = [jid for jid in job_ids if int(jid) not in paid_jobs]

✅ The Result

After fixing the type mismatch, the script correctly showed how many jobs were paid and unpaid out of the list. This made it much easier to verify our data integrity.

šŸ“¦ Requirements

To run this script, you’ll need:

  • Python 3
  • psycopg2 installed via pip3 install psycopg2-binary
  • A local urls.txt file
  • Valid PostgreSQL credentials

šŸ‘Ø‍šŸ’» Conclusion

This simple tool saved us hours of manual checking. It also shows how easily subtle bugs (like type mismatches) can throw off your results. We hope this helps someone facing a similar challenge.

Comments