Semantic Search Without Embeddings: Hierarchical Taxonomies and BM25
When you search for “long johns” at an outdoor gear store, you find base layers. Search for “slickers” and discover shells. This works because the search system understands semantic relationships—but not necessarily through vector embeddings.
Most teams assume semantic search requires embeddings and vector databases. You have better options that might align with your domain expertise and existing organizational knowledge.
What Semantic Search Actually Requires
Semantic search needs three components:
- Shared representation - A space where queries and content can be mapped
- Similarity function - A way to measure how related items are
- Match criteria - Rules for including or excluding results
Consider searching for “round red fruit that grows on trees.” You want apples, not baseballs.
We can map this query to properties: [round, red, fruit]. Our indexed items become:
- Apple:
[round, red, fruit](Score: 3) - Orange:
[round, orange, fruit](Score: 2) - Baseball:
[round, white, ball](Score: 1)
This simple approach demonstrates the core concept, though it requires comprehensive tagging and ignores important nuances.
The Embedding Approach
Vector embeddings learn representations from training data. When users click both apples and oranges for “fruit that grows on trees,” the system nudges their vectors closer together.
The training process:
| |
After training, similar items have similar vectors. We measure similarity using cosine similarity or euclidean distance.
The Missing Piece: Precise Matching
Embeddings excel at representation and similarity but struggle with match criteria. Users react poorly to search results that include baseballs when they search for fruit, even if the ranking is perfect.
Embeddings don’t provide clear cutoff thresholds. A similarity score of 0.8 might work for one domain but fail in another. When users specify multiple criteria, embeddings can’t prioritize which matters more—color or item type for “red apple.”
Hierarchical Taxonomies as an Alternative
Managed vocabularies organize information into hierarchical trees using domain-specific language. Consider this furniture category from Wayfair:
Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses / Novelty Rocking Horses
A query for “hobby horse” maps to:
Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses
This provides all three semantic search requirements:
- Representation: The category hierarchy
- Similarity: Direct matches rank higher than parents, which rank higher than grandparents
- Match criteria: Include siblings, exclude distant cousins
Implementing Taxonomy Search with BM25
You can achieve hierarchical similarity using standard BM25 indexing with a hierarchical tokenizer.
For the path Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses, the tokenizer produces:
['Baby & Kids',
'Baby & Kids / Toddler & Kids Playroom',
'Baby & Kids / Toddler & Kids Playroom / Indoor Play',
'Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses']
BM25 naturally scores rare matches higher than common ones. Root categories appear frequently (low score), while specific leaf categories appear rarely (high score). This creates the desired ranking: direct matches outrank parents, which outrank grandparents.
When searching, you query multiple hierarchy levels:
"Baby & Kids / ... / Rocking Horses" OR
"Baby & Kids / ... / Rocking Horses / Novelty Rocking Horses"
You control match criteria by limiting how far up the hierarchy you search.
Building Practical Taxonomies
Start simple. Ask an LLM to create basic categories:
Take the 7 primary colors. Create subtypes for each in the form
PRIMARY / SECONDARY
RED / Crimson
RED / Scarlet
RED / Burgundy
ORANGE / Tangerine
ORANGE / Amber
ORANGE / Coral
Evolve your taxonomy as categories become too broad. Split Baby & Kids into:
Baby & Kids / Toddler & Kids Playroom
Baby & Kids / Toddler & Kids Bedroom Furniture
Handle edge cases by assigning items to multiple categories with different weights.
LLM-Powered Classification
LLMs simplify taxonomy classification. Create embeddings for each category, then find the most similar category for new queries.
For better accuracy, ask the LLM to hallucinate realistic classifications first:
Be creative and hallucinate classifications for "hobby horse" that match these examples:
- 'Furniture / Living Room Furniture / Coffee Tables'
- 'Baby & Kids / Toddler & Kids Bedroom Furniture / Kids Beds'
Return diverse, related categories.
This generates more descriptive classifications that better match your actual taxonomy structure.
When to Choose Taxonomies Over Embeddings
Consider taxonomies when:
- Your domain requires precise categorization (legal, medical, fashion)
- Existing taxonomies already exist for your field
- You need explainable matching criteria
- Statistical fuzziness creates significant downside
Taxonomies work especially well combined with other ranking signals like keyword matching and targeted embedding search within categories.
Next Steps
Start with a simple taxonomy for your most important content categories. Use LLMs to classify new content and queries. Implement hierarchical tokenization in your existing search index. Measure whether users understand and accept your categorization-based results better than pure embedding approaches.