Extracting Keywords from Job Descriptions Using KeyBERT, NLTK, OpenAI

When you’re working with thousands or even millions of job descriptions, it’s essential to extract accurate, relevant keywords that allow you to:

  • Improve job search functionality (e.g., Algolia, Elasticsearch)
  • Cluster similar job types
  • Enable smarter filtering in your Django or backend application

πŸ” Why Not Gensim or Traditional NLP?

While Gensim is great for traditional NLP tasks like TF-IDF and topic modeling (LDA), it struggles with contextual understanding. For instance, it doesn’t know that “Django” and “Python web framework” are closely related. It also cannot extract meaningful multi-word phrases without significant tweaking.

✅ Meet KeyBERT

KeyBERT is a simple yet powerful tool built on top of BERT (or any transformer model) to extract keywords and keyphrases that actually make semantic sense.

✨ Benefits of KeyBERT:

  • Context-aware (understands meaning)
  • Extracts multi-word keyphrases
  • No training required – works out of the box
  • Highly customizable with domain-specific transformer models

⚙️ Installation

pip install keybert
pip install sentence-transformers

πŸ’» Example: Extracting Keywords from a Job Description

Here’s how to use KeyBERT to extract keywords from a job description:

from keybert import KeyBERT

# Load a compact and fast model
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

# Example job description
job_desc = "We are looking for a senior Python developer with strong Django skills and experience with REST APIs and PostgreSQL."

# Extract keywords
keywords = kw_model.extract_keywords(
    job_desc,
    keyphrase_ngram_range=(1, 3),
    stop_words='english',
    top_n=10
)

# Output
for kw, score in keywords:
    print(f"{kw} ({score:.2f})")

🧠 Output Example

This might give you output like:

  • python developer (0.84)
  • django skills (0.79)
  • rest apis (0.77)
  • senior python developer (0.75)
  • postgresql (0.72)

πŸ›  Saving to Django

If you’re using Django, you can save the keywords to a model field like this:

# models.py
class Job(models.Model):
    title = models.CharField(max_length=255)
    description = models.TextField()
    keywords = models.TextField(blank=True)

# After keyword extraction
job.keywords = ', '.join([kw for kw, _ in keywords])
job.save()

πŸ“Š KeyBERT vs Gensim vs JACE

Tool Context Aware Multi-word Ease of Use Quality Best For
KeyBERT ✅ Yes ✅ Yes ✅ Easy ✅ High Production-ready keyword extraction
Gensim (TF-IDF / LDA) ❌ No ❌ Rarely ✅ Easy ⚠️ Medium Topic modeling, low-resource systems
JACE / Jaseci NLP ✅ Partial ❓ Unknown ⚠️ Medium ❓ Experimental Complex NLP pipelines, action logic

πŸš€ Final Thoughts

If your goal is to extract clean, meaningful, and high-quality keywords from job descriptions — KeyBERT is one of the best solutions on the market. It’s easy to implement, works well with transformer models, and is flexible enough to use at scale.

Whether you’re using this in a Django backend, for improving Algolia/Elasticsearch search, or clustering similar jobs, it’ll give you a solid, modern foundation.


🧠 Pro Tip: For huge datasets, process jobs in batches with Celery or FastAPI and cache the results in your DB.

πŸš€ Improving Keyword Extraction for Job Descriptions with KeyBERT

While working on extracting keywords from Dutch job descriptions using KeyBERT, I faced a common issue: the model was returning noisy, irrelevant, or overly generic keywords such as:

[
  ('mobiele verpleegkundige', 0.39),
  ('ouder samenleving', 0.20),
  ('bijdragend positief', 0.13),
  ('wij zoek', 0.07),
  ('werkuren maaltijdcheques', -0.003),
  ('vragen diploma', -0.09)
]

Many of these phrases are too generic or grammatically meaningless. Let's walk through how to clean this up and produce only relevant, high-quality keywords.


🧠 What’s Causing This?

  • KeyBERT sometimes includes low-score phrases to meet top_n.
  • Default Dutch stopwords may not apply unless explicitly downloaded.
  • Job descriptions often contain repeated filler phrases like “Wij zijn op zoek...”
  • use_mmr=True can over-diversify keywords and introduce weird ones.

✅ The Fix (Step-by-Step)

1. Install NLTK and Download Dutch Stopwords

pip install nltk
python3 -c "import nltk; nltk.download('stopwords')"

2. Convert Stopwords to List (KeyBERT requires a list)

from nltk.corpus import stopwords
dutch_stopwords = list(stopwords.words('dutch'))

3. Extract Keywords Using KeyBERT

from keybert import KeyBERT
kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')

keywords = kw_model.extract_keywords(
    text,
    keyphrase_ngram_range=(1, 2),
    stop_words=dutch_stopwords,
    use_mmr=False,
    top_n=20
)

4. Filter by Relevance Score

# Only keep keywords above a minimum similarity score
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in keywords if score >= MIN_SCORE]

5. (Optional) Pre-clean the Job Text

If many job descriptions start with filler, strip that before processing:

if "Wij zijn op zoek" in text:
    text = text.split("Wij zijn op zoek", 1)[-1]

πŸ›  Optional Debugging Tips

  • Check what keywords contains with print(keywords)
  • If it's empty, try removing stop_words=... to test first
  • Use use_mmr=False for better relevance, and use_mmr=True for more diversity
  • Try reducing text[:512] to avoid overly long content

πŸ§ͺ Final Keyword Extraction Block

from keybert import KeyBERT
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

kw_model = KeyBERT('distiluse-base-multilingual-cased-v2')
dutch_stopwords = list(stopwords.words('dutch'))

text = job.description.strip()

raw_keywords = kw_model.extract_keywords(
    text,
    keyphrase_ngram_range=(1, 2),
    stop_words=dutch_stopwords,
    use_mmr=False,
    top_n=20
)

# Filter to remove weak or irrelevant phrases
MIN_SCORE = 0.25
keywords = [(kw, score) for kw, score in raw_keywords if score >= MIN_SCORE]

πŸ“Œ Summary

  • Install and use NLTK Dutch stopwords
  • Filter out low-score keywords (score < 0.25)
  • Disable use_mmr for better top results, or use with diversity=0.3 if needed
  • Optionally trim generic boilerplate from job descriptions before processing

This will drastically improve keyword quality for better tagging, filtering, and search indexing in applications like Django, Elasticsearch, or Algolia.


Author’s note: This post was generated after debugging a real production pipeline that extracted keywords from Dutch healthcare job descriptions. Have a similar issue? Drop me a message!

πŸ’‘ Extracting and Enriching Job Keywords Using OpenAI GPT-4 Turbo

In large-scale recruitment platforms, having structured, relevant keywords for every job posting is critical for:

  • Improving search and recommendation engines (e.g. Elasticsearch)
  • Auto-classifying and clustering similar jobs
  • Creating rich filtering interfaces for users

While tools like KeyBERT do a decent job for basic extraction, we’ve recently switched to using OpenAI’s GPT-4-turbo model for much higher quality and semantic understanding.


πŸ“Œ Why GPT-4 Turbo?

  • It understands job-related content in natural language — even in Dutch, German, or French
  • It generates not just keywords, but also extended descriptions
  • Perfect for deeply enriching the semantics of each job

πŸ” API Key Setup (Secure)

We recommend storing your OpenAI key securely in an environment variable:

# .env or environment
OPENAI_API_KEY=sk-...your-secret...
And in Django, read it like this:
import os
api_key = os.getenv("OPENAI_API_KEY")

Never hardcode the API key into scripts or Git repositories.


🧠 Prompt to Extract Keywords from Job Description

prompt = (
    "Extract the most relevant, job-specific Dutch keywords or short phrases "
    "(max 10) from the following job description. "
    "Only include real job functions, required skills, and tools. "
    "Return them as a comma-separated list.\n\n"
    f"{job.description.strip()}"
)

This is passed to GPT-4 Turbo for extremely accurate semantic extraction.


πŸ’» Django Command to Generate GPT-Based Keywords

from django.core.management.base import BaseCommand
from permanentjob.models import PermanentJob, JobKeyword
import openai
import os

class Command(BaseCommand):
    help = "Use GPT-4 Turbo to extract job keywords"

    def handle(self, *args, **options):
        openai.api_key = os.getenv("OPENAI_API_KEY")

        jobs = PermanentJob.objects.exclude(description__isnull=True).exclude(description__exact="")

        for job in jobs:
            prompt = (
                "Extract the most relevant, job-specific Dutch keywords (max 10) "
                "from the following job description. Only include short, real skill terms. "
                "Return a comma-separated list.\n\n"
                f"{job.description}"
            )

            try:
                response = openai.ChatCompletion.create(
                    model="gpt-4-turbo",
                    messages=[
                        {"role": "system", "content": "You are a professional HR assistant."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.3,
                    max_tokens=200
                )

                keyword_list = response.choices[0].message.content.strip().split(",")
                keyword_list = [kw.strip().lower() for kw in keyword_list if kw.strip()]

                job.keywords.clear()
                for kw in keyword_list:
                    obj, _ = JobKeyword.objects.get_or_create(name=kw, defaults={'source': 'openai'})
                    job.keywords.add(obj)
                    obj.count = obj.permanentjob_set.count()
                    obj.save()

                self.stdout.write(self.style.SUCCESS(f"Saved keywords for: {job.title}"))

            except Exception as e:
                self.stderr.write(f"❌ Error for job {job.id}: {str(e)}")

πŸ“Š Model Used: GPT-4 Turbo

We are using gpt-4-turbo, OpenAI’s most capable and expensive model available via API:

  • Model: gpt-4-turbo
  • Pricing: Approximately $0.01–$0.03 per 1K tokens
  • Tokens per request: ~400–1000 tokens per job description

πŸ‘‰ This may cost a few dollars per 1,000 jobs — but it’s worth it for premium-level enrichment.


πŸ’¬ Output Example

For a nursing job in Dutch, the result might be:

['verpleegkundige', 'ouderenzorg', 'palliatieve zorg', 'diabetesbegeleiding', 'wondverzorging']

All relevant and correctly scoped terms — no generic noise like "we zoeken".


✅ Final Notes

  • Keywords are stored in JobKeyword with source = "openai"
  • Descriptions can be added using a second GPT pass (500+ words in HTML)
  • You can combine this with manual or KeyBERT-based keywords

This approach gives you rich, semantically meaningful job tagging — ready for advanced filtering, search, and SEO.


πŸ” Optimizing Job Keyword Extraction and Matching via Elasticsearch

Keyword extraction and matching play a central role in modern job platforms — from improving search results to generating tags, filters, and recommendations.

We recently implemented a robust system for:

  • Creating a dedicated Elasticsearch index for permanent jobs
  • Automatically matching extracted keywords to jobs
  • Updating keyword usage with accurate counts
  • And — most importantly — avoiding false positives caused by vague or generic terms

πŸ—‚️ 1. Creating a Separate Elasticsearch Index for Permanent Jobs

To keep our search system clean and fast, we separated out different job types using HAYSTACK_CONNECTIONS in Django:

HAYSTACK_CONNECTIONS = {
  'default': {
    'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
    'URL': 'http://localhost:9200/',
    'INDEX_NAME': 'jobs_vacaturestoday',
    'EXCLUDED_INDEXES': ['permanentjob.search_indexes.PermanentJobIndex'],
  },
  'permanent_vacaturestoday': {
    'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
    'URL': 'http://localhost:9200/',
    'INDEX_NAME': 'permanent_vacaturestoday',
    'EXCLUDED_INDEXES': ['job.search_indexes.JobIndex'],
  },
}

This allowed us to run:

python manage.py rebuild_index --using=permanent_vacaturestoday

…without affecting any other indexes.


🏷️ 2. Keyword Matching via Elasticsearch

We then created a command that uses full-text search via Haystack to match keywords to jobs:

from haystack.query import SearchQuerySet

sqs = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name)
job_ids = [result.pk for result in sqs]

Each matching job is updated:

for job in PermanentJob.objects.filter(id__in=job_ids):
    job.keywords.add(keyword)

This creates an accurate ManyToMany relationship.


⚠️ 3. Problem: Too Many Generic Matches

We quickly found a flaw: keywords like werk, functie, and ervaring matched almost every job 😱

So we implemented filtering:

  • πŸ›‘ Stopword filtering (Dutch job language)
  • πŸ”  Minimum keyword length (e.g. > 3 characters)
  • πŸ“‰ Max match limit (e.g. ignore keywords matched in > 500 jobs)
STOPWORDS = {'werk', 'ervaring', 'vacature', 'functie', 'bij', 'ons', 'team'}

if keyword.name in STOPWORDS or len(keyword.name) < 4:
    continue

if len(job_ids) > 500:
    keyword.is_active = False
    keyword.save()
    continue

This eliminated nearly all noisy matches.


πŸ“Š 4. Count Synchronization

We maintain two fields on each keyword:

count_current = models.PositiveIntegerField(default=0)
count_permanent = models.PositiveIntegerField(default=0)

We update these using Elasticsearch for semantic relevance:

keyword.count_current = SearchQuerySet(using='default').models(Job).filter(content=keyword.name).count()
keyword.count_permanent = SearchQuerySet(using='permanent_vacaturestoday').models(PermanentJob).filter(content=keyword.name).count()
keyword.save()

Alternatively, if the keyword is already attached to the job via M2M, we can let the model calculate it via:

PermanentJob.objects.filter(keywords=keyword).count()

✅ 5. Best Practices We Learned

  • Isolate indexes per model (faster, safer)
  • Always clean your keyword list (stopwords, token length)
  • Don't just count matches — create real DB relations
  • Log and review top-used keywords to catch over-matching
  • Use is_active and source fields to manage keyword quality

πŸ’‘ Conclusion

Using OpenAI, KeyBERT, or manually curated keywords is powerful — but only if matched carefully. With full-text Elasticsearch + smart filtering, we built a system that produces clean, relevant, and meaningful keyword-to-job mapping that can scale across millions of listings.

🎯 The result? Better search UX, more accurate filtering, and fully automated enrichment of job data.

Comments