Recently, we needed to verify which job listings from a list of URLs were marked as payed = TRUE
in our PostgreSQL database. This was part of a quality control check to ensure our archive was up to date. Here’s how we approached it using a Python script.
š The Problem
We had a text file (urls.txt
) with lines like these:
https://www.domain.com/job/viewjob/20734193/...
https://www.domain.com/job/viewjob/25975411/...
Each URL contains a unique job ID, and we wanted to check which of those job IDs were already marked as paid in our PostgreSQL table called job_jobarchive
.
š The Python Script
Here’s the final script we used:
import re
import psycopg2
def extract_ids_from_file(file_path):
pattern = re.compile(r'/viewjob/(\\d+)/')
with open(file_path, 'r', encoding='utf-8') as file:
return [match.group(1) for line in file if (match := pattern.search(line))]
def check_payed_status(job_ids, db_config):
query = """
SELECT job_id
FROM job_jobarchive
WHERE job_id = %s AND payed = TRUE;
"""
paid_jobs = set()
try:
with psycopg2.connect(**db_config) as conn:
with conn.cursor() as cur:
for job_id in job_ids:
cur.execute(query, (job_id,))
result = cur.fetchone()
if result:
paid_jobs.add(result[0])
except Exception as e:
print("Database error:", e)
return paid_jobs
if __name__ == "__main__":
file_path = 'urls.txt'
db_config = {
'dbname': '',
'user': '',
'password': '',
'host': 'localhost',
'port': '5432'
}
job_ids = extract_ids_from_file(file_path)
paid_jobs = check_payed_status(job_ids, db_config)
total = len(job_ids)
paid = len(paid_jobs)
unpaid_ids = [jid for jid in job_ids if int(jid) not in paid_jobs]
print(f"\\nš Summary Report:")
print(f"---------------------")
print(f"Total URLs : {total}")
print(f"✅ Paid job IDs : {paid}")
print(f"❌ Unpaid job IDs : {len(unpaid_ids)}")
if unpaid_ids:
print(f"\\nList of unpaid job IDs:")
for uid in unpaid_ids:
print(uid)
š” The Bug We Discovered
At first, our script was showing something odd:
- Total job IDs: 1000
- Paid job IDs: 1000
- Unpaid job IDs: 1000
This was clearly wrong. The mistake was due to a type mismatch. The job IDs from the URLs were strings
, but the job IDs from the database were integers
. When comparing them, the match failed silently.
Fix: We converted the string IDs to integers before comparing:
unpaid_ids = [jid for jid in job_ids if int(jid) not in paid_jobs]
✅ The Result
After fixing the type mismatch, the script correctly showed how many jobs were paid and unpaid out of the list. This made it much easier to verify our data integrity.
š¦ Requirements
To run this script, you’ll need:
- Python 3
psycopg2
installed viapip3 install psycopg2-binary
- A local
urls.txt
file - Valid PostgreSQL credentials
šØš» Conclusion
This simple tool saved us hours of manual checking. It also shows how easily subtle bugs (like type mismatches) can throw off your results. We hope this helps someone facing a similar challenge.
Comments
Post a Comment