When you’re working with thousands or even millions of job descriptions, it’s essential to extract accurate, relevant keywords that allow you to:
- Improve job search functionality (e.g., Algolia, Elasticsearch)
- Cluster similar job types
- Enable smarter filtering in your Django or backend application
π Why Not Gensim or Traditional NLP?
While Gensim is great for traditional NLP tasks like TF-IDF and topic modeling (LDA), it struggles with contextual understanding. For instance, it doesn’t know that “Django” and “Python web framework” are closely related. It also cannot extract meaningful multi-word phrases without significant tweaking.
✅ Meet KeyBERT
KeyBERT is a simple yet powerful tool built on top of BERT (or any transformer model) to extract keywords and keyphrases that actually make semantic sense.
✨ Benefits of KeyBERT:
- Context-aware (understands meaning)
- Extracts multi-word keyphrases
- No training required – works out of the box
- Highly customizable with domain-specific transformer models
⚙️ Installation
pip install keybert
pip install sentence-transformers
π» Example: Extracting Keywords from a Job Description
Here’s how to use KeyBERT to extract keywords from a job description:
from keybert import KeyBERT
# Load a compact and fast model
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
# Example job description
job_desc = "We are looking for a senior Python developer with strong Django skills and experience with REST APIs and PostgreSQL."
# Extract keywords
keywords = kw_model.extract_keywords(
job_desc,
keyphrase_ngram_range=(1, 3),
stop_words='english',
top_n=10
)
# Output
for kw, score in keywords:
print(f"{kw} ({score:.2f})")
π§ Output Example
This might give you output like:
- python developer (0.84)
- django skills (0.79)
- rest apis (0.77)
- senior python developer (0.75)
- postgresql (0.72)
π Saving to Django
If you’re using Django, you can save the keywords to a model field like this:
# models.py
class Job(models.Model):
title = models.CharField(max_length=255)
description = models.TextField()
keywords = models.TextField(blank=True)
# After keyword extraction
job.keywords = ', '.join([kw for kw, _ in keywords])
job.save()
π KeyBERT vs Gensim vs JACE
Tool | Context Aware | Multi-word | Ease of Use | Quality | Best For |
---|---|---|---|---|---|
KeyBERT | ✅ Yes | ✅ Yes | ✅ Easy | ✅ High | Production-ready keyword extraction |
Gensim (TF-IDF / LDA) | ❌ No | ❌ Rarely | ✅ Easy | ⚠️ Medium | Topic modeling, low-resource systems |
JACE / Jaseci NLP | ✅ Partial | ❓ Unknown | ⚠️ Medium | ❓ Experimental | Complex NLP pipelines, action logic |
π Final Thoughts
If your goal is to extract clean, meaningful, and high-quality keywords from job descriptions — KeyBERT is one of the best solutions on the market. It’s easy to implement, works well with transformer models, and is flexible enough to use at scale.
Whether you’re using this in a Django backend, for improving Algolia/Elasticsearch search, or clustering similar jobs, it’ll give you a solid, modern foundation.
π§ Pro Tip: For huge datasets, process jobs in batches with Celery or FastAPI and cache the results in your DB.
π Improving Keyword Extraction for Job Descriptions with KeyBERT
While working on extracting keywords from Dutch job descriptions using KeyBERT, I faced a common issue: the model was returning noisy, irrelevant, or overly generic keywords such as:
[
('mobiele verpleegkundige', 0.39),
('ouder samenleving', 0.20),
('bijdragend positief', 0.13),
('wij zoek', 0.07),
('werkuren maaltijdcheques', -0.003),
('vragen diploma', -0.09)
]
Many of these phrases are too generic or grammatically meaningless. Let's walk through how to clean this up and produce only relevant, high-quality keywords.
π§ What’s Causing This?
- KeyBERT sometimes includes low-score phrases to meet
top_n
. - Default Dutch stopwords may not apply unless explicitly downloaded.
- Job descriptions often contain repeated filler phrases like “Wij zijn op zoek...”
use_mmr=True
can over-diversify keywords and introduce weird ones.
✅ The Fix (Step-by-Step)
1. Install NLTK and Download Dutch Stopwords
pip install nltk
python3 -c "import nltk; nltk.download('stopwords')"
2. Convert Stopwords to List (KeyBERT requires a list)
from nltk.corpus import stopwords
dutch_stopwords = list(stopwords.words('dutch'))
3. Extract Keywords Using KeyBERT
from keybert import KeyBERT
kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')
keywords = kw_model.extract_keywords(
text,
keyphrase_ngram_range=(1, 2),
stop_words=dutch_stopwords,
use_mmr=False,
top_n=20
)
4. Filter by Relevance Score
# Only keep keywords above a minimum similarity score
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in keywords if score >= MIN_SCORE]
5. (Optional) Pre-clean the Job Text
If many job descriptions start with filler, strip that before processing:
if "Wij zijn op zoek" in text:
text = text.split("Wij zijn op zoek", 1)[-1]
π Optional Debugging Tips
- Check what
keywords
contains withprint(keywords)
- If it's empty, try removing
stop_words=...
to test first - Use
use_mmr=False
for better relevance, anduse_mmr=True
for more diversity - Try reducing
text[:512]
to avoid overly long content
π§ͺ Final Keyword Extraction Block
from keybert import KeyBERT
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')
dutch_stopwords = list(stopwords.words('dutch'))
text = job.description.strip()
raw_keywords = kw_model.extract_keywords(
text,
keyphrase_ngram_range=(1, 2),
stop_words=dutch_stopwords,
use_mmr=False,
top_n=20
)
# Filter to remove weak or irrelevant phrases
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in raw_keywords if score >= MIN_SCORE]
π Summary
- Install and use NLTK Dutch stopwords
- Filter out low-score keywords (
score < 0.25
) - Disable
use_mmr
for better top results, or use withdiversity=0.3
if needed - Optionally trim generic boilerplate from job descriptions before processing
This will drastically improve keyword quality for better tagging, filtering, and search indexing in applications like Django, Elasticsearch, or Algolia.
Author’s note: This post was generated after debugging a real production pipeline that extracted keywords from Dutch healthcare job descriptions. Have a similar issue? Drop me a message!
π‘ Extracting and Enriching Job Keywords Using OpenAI GPT-4 Turbo
In large-scale recruitment platforms, having structured, relevant keywords for every job posting is critical for:
- Improving search and recommendation engines (e.g. Elasticsearch)
- Auto-classifying and clustering similar jobs
- Creating rich filtering interfaces for users
While tools like KeyBERT do a decent job for basic extraction, we’ve recently switched to using OpenAI’s GPT-4-turbo model for much higher quality and semantic understanding.
π Why GPT-4 Turbo?
- It understands job-related content in natural language — even in Dutch, German, or French
- It generates not just keywords, but also extended descriptions
- Perfect for deeply enriching the semantics of each job
π API Key Setup (Secure)
We recommend storing your OpenAI key securely in an environment variable:
# .env or environment
OPENAI_API_KEY=sk-...your-secret...
And in Django, read it like this:
import os
api_key = os.getenv("OPENAI_API_KEY")
Never hardcode the API key into scripts or Git repositories.
π§ Prompt to Extract Keywords from Job Description
prompt = (
"Extract the most relevant, job-specific Dutch keywords or short phrases "
"(max 10) from the following job description. "
"Only include real job functions, required skills, and tools. "
"Return them as a comma-separated list.\n\n"
f"{job.description.strip()}"
)
This is passed to GPT-4 Turbo
for extremely accurate semantic extraction.
π» Django Command to Generate GPT-Based Keywords
from django.core.management.base import BaseCommand
from permanentjob.models import PermanentJob, JobKeyword
import openai
import os
class Command(BaseCommand):
help = "Use GPT-4 Turbo to extract job keywords"
def handle(self, *args, **options):
openai.api_key = os.getenv("OPENAI_API_KEY")
jobs = PermanentJob.objects.exclude(description__isnull=True).exclude(description__exact="")
for job in jobs:
prompt = (
"Extract the most relevant, job-specific Dutch keywords (max 10) "
"from the following job description. Only include short, real skill terms. "
"Return a comma-separated list.\n\n"
f"{job.description}"
)
try:
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "You are a professional HR assistant."},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=200
)
keyword_list = response.choices[0].message.content.strip().split(",")
keyword_list = [kw.strip().lower() for kw in keyword_list if kw.strip()]
job.keywords.clear()
for kw in keyword_list:
obj, _ = JobKeyword.objects.get_or_create(name=kw, defaults={'source': 'openai'})
job.keywords.add(obj)
obj.count = obj.permanentjob_set.count()
obj.save()
self.stdout.write(self.style.SUCCESS(f"Saved keywords for: {job.title}"))
except Exception as e:
self.stderr.write(f"❌ Error for job {job.id}: {str(e)}")
π Model Used: GPT-4 Turbo
We are using gpt-4-turbo
, OpenAI’s most capable and expensive model available via API:
- Model:
gpt-4-turbo
- Pricing: Approximately $0.01–$0.03 per 1K tokens
- Tokens per request: ~400–1000 tokens per job description
π This may cost a few dollars per 1,000 jobs — but it’s worth it for premium-level enrichment.
π¬ Output Example
For a nursing job in Dutch, the result might be:
['verpleegkundige', 'ouderenzorg', 'palliatieve zorg', 'diabetesbegeleiding', 'wondverzorging']
All relevant and correctly scoped terms — no generic noise like "we zoeken".
✅ Final Notes
- Keywords are stored in
JobKeyword
withsource = "openai"
- Descriptions can be added using a second GPT pass (500+ words in HTML)
- You can combine this with manual or KeyBERT-based keywords
This approach gives you rich, semantically meaningful job tagging — ready for advanced filtering, search, and SEO.
π Optimizing Job Keyword Extraction and Matching via Elasticsearch
Keyword extraction and matching play a central role in modern job platforms — from improving search results to generating tags, filters, and recommendations.
We recently implemented a robust system for:
- Creating a dedicated Elasticsearch index for permanent jobs
- Automatically matching extracted keywords to jobs
- Updating keyword usage with accurate counts
- And — most importantly — avoiding false positives caused by vague or generic terms
π️ 1. Creating a Separate Elasticsearch Index for Permanent Jobs
To keep our search system clean and fast, we separated out different job types using HAYSTACK_CONNECTIONS
in Django:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
'URL': 'http://localhost:9200/',
'INDEX_NAME': 'jobs_vacaturestoday',
'EXCLUDED_INDEXES': ['permanentjob.search_indexes.PermanentJobIndex'],
},
'permanent_vacaturestoday': {
'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
'URL': 'http://localhost:9200/',
'INDEX_NAME': 'permanent_vacaturestoday',
'EXCLUDED_INDEXES': ['job.search_indexes.JobIndex'],
},
}
This allowed us to run:
python manage.py rebuild_index --using=permanent_vacaturestoday
…without affecting any other indexes.
π·️ 2. Keyword Matching via Elasticsearch
We then created a command that uses full-text search via Haystack to match keywords to jobs:
from haystack.query import SearchQuerySet
sqs = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name)
job_ids = [result.pk for result in sqs]
Each matching job is updated:
for job in PermanentJob.objects.filter(id__in=job_ids):
job.keywords.add(keyword)
This creates an accurate ManyToMany relationship.
⚠️ 3. Problem: Too Many Generic Matches
We quickly found a flaw: keywords like werk
, functie
, and ervaring
matched almost every job π±
So we implemented filtering:
- π Stopword filtering (Dutch job language)
- π Minimum keyword length (e.g. > 3 characters)
- π Max match limit (e.g. ignore keywords matched in > 500 jobs)
STOPWORDS = {'werk', 'ervaring', 'vacature', 'functie', 'bij', 'ons', 'team'}
if keyword.name in STOPWORDS or len(keyword.name) < 4:
continue
if len(job_ids) > 500:
keyword.is_active = False
keyword.save()
continue
This eliminated nearly all noisy matches.
π 4. Count Synchronization
We maintain two fields on each keyword:
count_current = models.PositiveIntegerField(default=0)
count_permanent = models.PositiveIntegerField(default=0)
We update these using Elasticsearch for semantic relevance:
keyword.count_current = SearchQuerySet(using='default').models(Job).filter(content=keyword.name).count()
keyword.count_permanent = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name).count()
keyword.save()
Alternatively, if the keyword is already attached to the job via M2M, we can let the model calculate it via:
PermanentJob.objects.filter(keywords=keyword).count()
✅ 5. Best Practices We Learned
- Isolate indexes per model (faster, safer)
- Always clean your keyword list (stopwords, token length)
- Don't just count matches — create real DB relations
- Log and review top-used keywords to catch over-matching
- Use
is_active
andsource
fields to manage keyword quality
π‘ Conclusion
Using OpenAI, KeyBERT, or manually curated keywords is powerful — but only if matched carefully. With full-text Elasticsearch + smart filtering, we built a system that produces clean, relevant, and meaningful keyword-to-job mapping that can scale across millions of listings.
π― The result? Better search UX, more accurate filtering, and fully automated enrichment of job data.
Comments
Post a Comment