When you’re working with thousands or even millions of job descriptions, it’s essential to extract accurate, relevant keywords that allow you to:

Improve job search functionality (e.g., Algolia, Elasticsearch)
Cluster similar job types
Enable smarter filtering in your Django or backend application

🔍 Why Not Gensim or Traditional NLP?

While Gensim is great for traditional NLP tasks like TF-IDF and topic modeling (LDA), it struggles with contextual understanding. For instance, it doesn’t know that “Django” and “Python web framework” are closely related. It also cannot extract meaningful multi-word phrases without significant tweaking.

✅ Meet KeyBERT

KeyBERT is a simple yet powerful tool built on top of BERT (or any transformer model) to extract keywords and keyphrases that actually make semantic sense.

✨ Benefits of KeyBERT:

Context-aware (understands meaning)
Extracts multi-word keyphrases
No training required – works out of the box
Highly customizable with domain-specific transformer models

⚙️ Installation

pip install keybert
pip install sentence-transformers

💻 Example: Extracting Keywords from a Job Description

Here’s how to use KeyBERT to extract keywords from a job description:

from keybert import KeyBERT

# Load a compact and fast model
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

# Example job description
job_desc = "We are looking for a senior Python developer with strong Django skills and experience with REST APIs and PostgreSQL."

# Extract keywords
keywords = kw_model.extract_keywords(
    job_desc,
    keyphrase_ngram_range=(1, 3),
    stop_words='english',
    top_n=10
)

# Output
for kw, score in keywords:
    print(f"{kw} ({score:.2f})")

🧠 Output Example

This might give you output like:

python developer (0.84)
django skills (0.79)
rest apis (0.77)
senior python developer (0.75)
postgresql (0.72)

🛠 Saving to Django

If you’re using Django, you can save the keywords to a model field like this:

# models.py
class Job(models.Model):
    title = models.CharField(max_length=255)
    description = models.TextField()
    keywords = models.TextField(blank=True)

# After keyword extraction
job.keywords = ', '.join([kw for kw, _ in keywords])
job.save()

📊 KeyBERT vs Gensim vs JACE

Tool	Context Aware	Multi-word	Ease of Use	Quality	Best For
KeyBERT	✅ Yes	✅ Yes	✅ Easy	✅ High	Production-ready keyword extraction
Gensim (TF-IDF / LDA)	❌ No	❌ Rarely	✅ Easy	⚠️ Medium	Topic modeling, low-resource systems
JACE / Jaseci NLP	✅ Partial	❓ Unknown	⚠️ Medium	❓ Experimental	Complex NLP pipelines, action logic

🚀 Final Thoughts

If your goal is to extract clean, meaningful, and high-quality keywords from job descriptions — KeyBERT is one of the best solutions on the market. It’s easy to implement, works well with transformer models, and is flexible enough to use at scale.

Whether you’re using this in a Django backend, for improving Algolia/Elasticsearch search, or clustering similar jobs, it’ll give you a solid, modern foundation.

🧠 Pro Tip: For huge datasets, process jobs in batches with Celery or FastAPI and cache the results in your DB.

🚀 Improving Keyword Extraction for Job Descriptions with KeyBERT

While working on extracting keywords from Dutch job descriptions using KeyBERT, I faced a common issue: the model was returning noisy, irrelevant, or overly generic keywords such as:

[
  ('mobiele verpleegkundige', 0.39),
  ('ouder samenleving', 0.20),
  ('bijdragend positief', 0.13),
  ('wij zoek', 0.07),
  ('werkuren maaltijdcheques', -0.003),
  ('vragen diploma', -0.09)
]

Many of these phrases are too generic or grammatically meaningless. Let's walk through how to clean this up and produce only relevant, high-quality keywords.

🧠 What’s Causing This?

KeyBERT sometimes includes low-score phrases to meet top_n.
Default Dutch stopwords may not apply unless explicitly downloaded.
Job descriptions often contain repeated filler phrases like “Wij zijn op zoek...”
use_mmr=True can over-diversify keywords and introduce weird ones.

✅ The Fix (Step-by-Step)

1. Install NLTK and Download Dutch Stopwords

pip install nltk
python3 -c "import nltk; nltk.download('stopwords')"

2. Convert Stopwords to List (KeyBERT requires a list)

from nltk.corpus import stopwords
dutch_stopwords = list(stopwords.words('dutch'))

3. Extract Keywords Using KeyBERT

from keybert import KeyBERT
kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')

keywords = kw_model.extract_keywords(
    text,
    keyphrase_ngram_range=(1, 2),
    stop_words=dutch_stopwords,
    use_mmr=False,
    top_n=20
)

4. Filter by Relevance Score

# Only keep keywords above a minimum similarity score
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in keywords if score >= MIN_SCORE]

5. (Optional) Pre-clean the Job Text

If many job descriptions start with filler, strip that before processing:

if "Wij zijn op zoek" in text:
    text = text.split("Wij zijn op zoek", 1)[-1]

🛠 Optional Debugging Tips

Check what keywords contains with print(keywords)
If it's empty, try removing stop_words=... to test first
Use use_mmr=False for better relevance, and use_mmr=True for more diversity
Try reducing text[:512] to avoid overly long content

🧪 Final Keyword Extraction Block

from keybert import KeyBERT
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')
dutch_stopwords = list(stopwords.words('dutch'))

text = job.description.strip()

raw_keywords = kw_model.extract_keywords(
    text,
    keyphrase_ngram_range=(1, 2),
    stop_words=dutch_stopwords,
    use_mmr=False,
    top_n=20
)

# Filter to remove weak or irrelevant phrases
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in raw_keywords if score >= MIN_SCORE]

📌 Summary

Install and use NLTK Dutch stopwords
Filter out low-score keywords (score < 0.25)
Disable use_mmr for better top results, or use with diversity=0.3 if needed
Optionally trim generic boilerplate from job descriptions before processing

This will drastically improve keyword quality for better tagging, filtering, and search indexing in applications like Django, Elasticsearch, or Algolia.

Author’s note: This post was generated after debugging a real production pipeline that extracted keywords from Dutch healthcare job descriptions. Have a similar issue? Drop me a message!

💡 Extracting and Enriching Job Keywords Using OpenAI GPT-4 Turbo

In large-scale recruitment platforms, having structured, relevant keywords for every job posting is critical for:

Improving search and recommendation engines (e.g. Elasticsearch)
Auto-classifying and clustering similar jobs
Creating rich filtering interfaces for users

While tools like KeyBERT do a decent job for basic extraction, we’ve recently switched to using OpenAI’s GPT-4-turbo model for much higher quality and semantic understanding.

📌 Why GPT-4 Turbo?

It understands job-related content in natural language — even in Dutch, German, or French
It generates not just keywords, but also extended descriptions
Perfect for deeply enriching the semantics of each job

🔐 API Key Setup (Secure)

We recommend storing your OpenAI key securely in an environment variable:

# .env or environment
OPENAI_API_KEY=sk-...your-secret...

And in Django, read it like this:

import os
api_key = os.getenv("OPENAI_API_KEY")

Never hardcode the API key into scripts or Git repositories.

🧠 Prompt to Extract Keywords from Job Description

prompt = (
    "Extract the most relevant, job-specific Dutch keywords or short phrases "
    "(max 10) from the following job description. "
    "Only include real job functions, required skills, and tools. "
    "Return them as a comma-separated list.\n\n"
    f"{job.description.strip()}"
)

This is passed to GPT-4 Turbo for extremely accurate semantic extraction.

💻 Django Command to Generate GPT-Based Keywords

from django.core.management.base import BaseCommand
from permanentjob.models import PermanentJob, JobKeyword
import openai
import os

class Command(BaseCommand):
    help = "Use GPT-4 Turbo to extract job keywords"

    def handle(self, *args, **options):
        openai.api_key = os.getenv("OPENAI_API_KEY")

        jobs = PermanentJob.objects.exclude(description__isnull=True).exclude(description__exact="")

        for job in jobs:
            prompt = (
                "Extract the most relevant, job-specific Dutch keywords (max 10) "
                "from the following job description. Only include short, real skill terms. "
                "Return a comma-separated list.\n\n"
                f"{job.description}"
            )

            try:
                response = openai.ChatCompletion.create(
                    model="gpt-4-turbo",
                    messages=[
                        {"role": "system", "content": "You are a professional HR assistant."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.3,
                    max_tokens=200
                )

                keyword_list = response.choices[0].message.content.strip().split(",")
                keyword_list = [kw.strip().lower() for kw in keyword_list if kw.strip()]

                job.keywords.clear()
                for kw in keyword_list:
                    obj, _ = JobKeyword.objects.get_or_create(name=kw, defaults={'source': 'openai'})
                    job.keywords.add(obj)
                    obj.count = obj.permanentjob_set.count()
                    obj.save()

                self.stdout.write(self.style.SUCCESS(f"Saved keywords for: {job.title}"))

            except Exception as e:
                self.stderr.write(f"❌ Error for job {job.id}: {str(e)}")

📊 Model Used: GPT-4 Turbo

We are using gpt-4-turbo, OpenAI’s most capable and expensive model available via API:

Model: gpt-4-turbo
Pricing: Approximately $0.01–$0.03 per 1K tokens
Tokens per request: ~400–1000 tokens per job description

👉 This may cost a few dollars per 1,000 jobs — but it’s worth it for premium-level enrichment.

💬 Output Example

For a nursing job in Dutch, the result might be:

['verpleegkundige', 'ouderenzorg', 'palliatieve zorg', 'diabetesbegeleiding', 'wondverzorging']

All relevant and correctly scoped terms — no generic noise like "we zoeken".

✅ Final Notes

Keywords are stored in JobKeyword with source = "openai"
Descriptions can be added using a second GPT pass (500+ words in HTML)
You can combine this with manual or KeyBERT-based keywords

This approach gives you rich, semantically meaningful job tagging — ready for advanced filtering, search, and SEO.

🔍 Optimizing Job Keyword Extraction and Matching via Elasticsearch

Keyword extraction and matching play a central role in modern job platforms — from improving search results to generating tags, filters, and recommendations.

We recently implemented a robust system for:

Creating a dedicated Elasticsearch index for permanent jobs
Automatically matching extracted keywords to jobs
Updating keyword usage with accurate counts
And — most importantly — avoiding false positives caused by vague or generic terms

🗂️ 1. Creating a Separate Elasticsearch Index for Permanent Jobs

To keep our search system clean and fast, we separated out different job types using HAYSTACK_CONNECTIONS in Django:

HAYSTACK_CONNECTIONS = {
  'default': {
    'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
    'URL': 'http://localhost:9200/',
    'INDEX_NAME': 'jobs_vacaturestoday',
    'EXCLUDED_INDEXES': ['permanentjob.search_indexes.PermanentJobIndex'],
  },
  'permanent_vacaturestoday': {
    'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
    'URL': 'http://localhost:9200/',
    'INDEX_NAME': 'permanent_vacaturestoday',
    'EXCLUDED_INDEXES': ['job.search_indexes.JobIndex'],
  },
}

This allowed us to run:

python manage.py rebuild_index --using=permanent_vacaturestoday

…without affecting any other indexes.

🏷️ 2. Keyword Matching via Elasticsearch

We then created a command that uses full-text search via Haystack to match keywords to jobs:

from haystack.query import SearchQuerySet

sqs = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name)
job_ids = [result.pk for result in sqs]

Each matching job is updated:

for job in PermanentJob.objects.filter(id__in=job_ids):
    job.keywords.add(keyword)

This creates an accurate ManyToMany relationship.

⚠️ 3. Problem: Too Many Generic Matches

We quickly found a flaw: keywords like werk, functie, and ervaring matched almost every job 😱

So we implemented filtering:

🛑 Stopword filtering (Dutch job language)
🔠 Minimum keyword length (e.g. > 3 characters)
📉 Max match limit (e.g. ignore keywords matched in > 500 jobs)

STOPWORDS = {'werk', 'ervaring', 'vacature', 'functie', 'bij', 'ons', 'team'}

if keyword.name in STOPWORDS or len(keyword.name) < 4:
    continue

if len(job_ids) > 500:
    keyword.is_active = False
    keyword.save()
    continue

This eliminated nearly all noisy matches.

📊 4. Count Synchronization

We maintain two fields on each keyword:

count_current = models.PositiveIntegerField(default=0)
count_permanent = models.PositiveIntegerField(default=0)

We update these using Elasticsearch for semantic relevance:

keyword.count_current = SearchQuerySet(using='default').models(Job).filter(content=keyword.name).count()
keyword.count_permanent = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name).count()
keyword.save()

Alternatively, if the keyword is already attached to the job via M2M, we can let the model calculate it via:

PermanentJob.objects.filter(keywords=keyword).count()

✅ 5. Best Practices We Learned

Isolate indexes per model (faster, safer)
Always clean your keyword list (stopwords, token length)
Don't just count matches — create real DB relations
Log and review top-used keywords to catch over-matching
Use is_active and source fields to manage keyword quality

💡 Conclusion

Using OpenAI, KeyBERT, or manually curated keywords is powerful — but only if matched carefully. With full-text Elasticsearch + smart filtering, we built a system that produces clean, relevant, and meaningful keyword-to-job mapping that can scale across millions of listings.

🎯 The result? Better search UX, more accurate filtering, and fully automated enrichment of job data.

Search This Blog

Extracting Keywords from Job Descriptions Using KeyBERT, NLTK, OpenAI