Semantic Search Without Embeddings: Hierarchical Taxonomies and BM25

When you search for “long johns” at an outdoor gear store, you find base layers. Search for “slickers” and discover shells. This works because the search system understands semantic relationships—but not necessarily through vector embeddings.

Most teams assume semantic search requires embeddings and vector databases. You have better options that might align with your domain expertise and existing organizational knowledge.

What Semantic Search Actually Requires

Semantic search needs three components:

  1. Shared representation - A space where queries and content can be mapped
  2. Similarity function - A way to measure how related items are
  3. Match criteria - Rules for including or excluding results

Consider searching for “round red fruit that grows on trees.” You want apples, not baseballs.

We can map this query to properties: [round, red, fruit]. Our indexed items become:

  • Apple: [round, red, fruit] (Score: 3)
  • Orange: [round, orange, fruit] (Score: 2)
  • Baseball: [round, white, ball] (Score: 1)

This simple approach demonstrates the core concept, though it requires comprehensive tagging and ignores important nuances.

The Embedding Approach

Vector embeddings learn representations from training data. When users click both apples and oranges for “fruit that grows on trees,” the system nudges their vectors closer together.

The training process:

1
2
3
4
5
for session in sessions:
    for clicked_item_1 in session:
        for clicked_item_2 in session:
            if clicked_item_1 != clicked_item_2:
                nudge_closer(clicked_item_1, clicked_item_2)

After training, similar items have similar vectors. We measure similarity using cosine similarity or euclidean distance.

The Missing Piece: Precise Matching

Embeddings excel at representation and similarity but struggle with match criteria. Users react poorly to search results that include baseballs when they search for fruit, even if the ranking is perfect.

Embeddings don’t provide clear cutoff thresholds. A similarity score of 0.8 might work for one domain but fail in another. When users specify multiple criteria, embeddings can’t prioritize which matters more—color or item type for “red apple.”

Hierarchical Taxonomies as an Alternative

Managed vocabularies organize information into hierarchical trees using domain-specific language. Consider this furniture category from Wayfair:

Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses / Novelty Rocking Horses

A query for “hobby horse” maps to:

Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses

This provides all three semantic search requirements:

  • Representation: The category hierarchy
  • Similarity: Direct matches rank higher than parents, which rank higher than grandparents
  • Match criteria: Include siblings, exclude distant cousins

Implementing Taxonomy Search with BM25

You can achieve hierarchical similarity using standard BM25 indexing with a hierarchical tokenizer.

For the path Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses, the tokenizer produces:

['Baby & Kids',
 'Baby & Kids / Toddler & Kids Playroom', 
 'Baby & Kids / Toddler & Kids Playroom / Indoor Play',
 'Baby & Kids / Toddler & Kids Playroom / Indoor Play / Rocking Horses']

BM25 naturally scores rare matches higher than common ones. Root categories appear frequently (low score), while specific leaf categories appear rarely (high score). This creates the desired ranking: direct matches outrank parents, which outrank grandparents.

When searching, you query multiple hierarchy levels:

"Baby & Kids / ... / Rocking Horses" OR
"Baby & Kids / ... / Rocking Horses / Novelty Rocking Horses"

You control match criteria by limiting how far up the hierarchy you search.

Building Practical Taxonomies

Start simple. Ask an LLM to create basic categories:

Take the 7 primary colors. Create subtypes for each in the form PRIMARY / SECONDARY

RED / Crimson
RED / Scarlet  
RED / Burgundy
ORANGE / Tangerine
ORANGE / Amber
ORANGE / Coral

Evolve your taxonomy as categories become too broad. Split Baby & Kids into:

Baby & Kids / Toddler & Kids Playroom
Baby & Kids / Toddler & Kids Bedroom Furniture

Handle edge cases by assigning items to multiple categories with different weights.

LLM-Powered Classification

LLMs simplify taxonomy classification. Create embeddings for each category, then find the most similar category for new queries.

For better accuracy, ask the LLM to hallucinate realistic classifications first:

Be creative and hallucinate classifications for "hobby horse" that match these examples:
- 'Furniture / Living Room Furniture / Coffee Tables'
- 'Baby & Kids / Toddler & Kids Bedroom Furniture / Kids Beds'

Return diverse, related categories.

This generates more descriptive classifications that better match your actual taxonomy structure.

When to Choose Taxonomies Over Embeddings

Consider taxonomies when:

  • Your domain requires precise categorization (legal, medical, fashion)
  • Existing taxonomies already exist for your field
  • You need explainable matching criteria
  • Statistical fuzziness creates significant downside

Taxonomies work especially well combined with other ranking signals like keyword matching and targeted embedding search within categories.

Next Steps

Start with a simple taxonomy for your most important content categories. Use LLMs to classify new content and queries. Implement hierarchical tokenization in your existing search index. Measure whether users understand and accept your categorization-based results better than pure embedding approaches.