Why it matters
- Zero cost for large keyword lists — clustering 100,000 keywords with a SaaS tool costs hundreds of dollars in credits; this runs locally for free.
- Full control over clustering parameters — adjust similarity thresholds, choose different embedding models, post-process clusters with custom logic.
- Privacy and data security — no keywords sent to external APIs; useful for industries where keyword research reveals sensitive business strategy.
- Educational value — examine exactly how semantic similarity clustering works as a foundation for understanding NLP-based keyword grouping.
Key capabilities
- Embedding generation: Convert keywords to semantic vectors using sentence-transformers.
- Cosine similarity: Measure keyword similarity in embedding space.
- K-means or hierarchical clustering: Group keywords into configurable number of clusters.
- CSV input/output: Feed in standard keyword CSV exports from any SEO tool.
- Customizable models: Swap in different sentence-transformer models (larger/smaller, multilingual).
- Threshold tuning: Adjust similarity thresholds to control cluster granularity.
- Local execution: Runs entirely on your machine without external API calls.
Technical notes
- Language: Python
- Key dependencies:
sentence-transformers, scikit-learn, pandas
- Install:
pip install sentence-transformers scikit-learn pandas
- Embedding model:
all-MiniLM-L6-v2 (default); swappable
- Input: CSV with keyword column
- Output: CSV with cluster labels added
- License: Open source (check repository)
- GitHub: github.com/AndreiMikhalevich/KeywordClustering
Getting started
git clone https://github.com/AndreiMikhalevich/KeywordClustering
pip install -r requirements.txt
# Add keywords.csv with a 'keyword' column
python cluster_keywords.py --input keywords.csv --output clustered.csv
Ideal for
- Developers and data scientists who want embedding-based keyword clustering with full code control.
- SEO practitioners with large keyword lists who want to avoid per-keyword SaaS costs.
- Teams exploring how embedding-based clustering works before deciding on a commercial tool.
Not ideal for
- Non-technical users — requires Python setup; commercial tools like KeywordInsights or WriterZen are easier.
- High-accuracy clustering for production SEO — SERP-based clustering (KeywordInsights) is more accurate for actual search intent.
- Teams needing features like SERP data, content briefs, or search volume integration alongside clustering.
See also
- KeywordInsights — Commercial keyword clustering with SERP-based accuracy; more reliable for production SEO.
- WriterZen — Keyword research + clustering + briefs in one platform; no coding required.
- LowFruits — Finds weak-competition keywords within clusters; complements clustering for opportunity analysis.