Building DoorDash’s product knowledge graph with large language models

Building DoorDash’s Product Knowledge Graph with Large Language Models

DoorDash uses large language models to extract and standardize product attributes from merchant data, solving the cold-start problem that previously required manual operator labeling. The approach cuts processing time from months to weeks while enabling personalized recommendations and accurate product matching.

The Challenge

DoorDash’s retail catalog stores essential product information for SKUs across grocery stores, convenience stores, and other non-restaurant merchants. Each SKU contains attributes like brand, size, and organic status. When merchants onboard, their SKU data arrives in inconsistent formats with missing or incorrect values.

Manual enrichment by contract operators created three problems:

  • Slow turnaround times delayed adding products to the active catalog
  • High costs from multiple human reviews per SKU
  • Quality issues requiring secondary audits

Traditional machine learning models need extensive labeled training data to extract attributes accurately. Collecting this data slowed development and increased costs.

LLM-Powered Attribute Extraction

Large language models solve the cold-start problem. Their broad training enables them to perform natural language processing tasks without requiring many labeled examples. DoorDash built three LLM-powered pipelines to extract different attribute types.

Brand Extraction Pipeline

Brand identification matters for sponsored ads and product affinity. DoorDash maintains a hierarchical brand taxonomy with manufacturers, parent brands, and sub-brands.

The pipeline works in four steps:

  1. An in-house classifier attempts to tag SKUs to existing brands
  2. LLMs extract brand information from SKUs the classifier cannot tag confidently
  3. A second LLM checks whether extracted brands duplicate existing taxonomy entries by retrieving similar brands from the knowledge graph
  4. New brands enter the taxonomy and retrain the classifier

This approach identifies new brands proactively at scale rather than reactively filling gaps as business needs arise.

Organic Product Labeling

Consumers search for organic products based on dietary preferences. DoorDash built a waterfall pipeline that maximizes speed and coverage:

String matching finds exact mentions of “organic” in product titles. This achieves high precision but misses misspellings and alternative presentations.

LLM reasoning determines organic status from merchant data or optical character recognition extraction from packaging photos. This catches cases string matching misses.

LLM agents search online for product information and pipe results to another LLM for reasoning. This further boosts coverage.

The pipeline enabled personalized carousels targeting customers with organic affinity, improving engagement metrics.

Generalized Attribute Extraction

Entity resolution determines whether two SKUs from different merchants represent the same product. For example, “Corona Extra Mexican Lager (12 oz x 12 ct)” from Safeway matches “Corona Extra Mexican Lager Beer Bottles, 12 pk, 12 fl oz” from BevMo!.

Accurate entity resolution requires extracting all defining attributes for each product category. Alcohol needs vintage, aging, and flavor. Electronics need different attributes.

DoorDash used retrieval augmented generation (RAG) to accelerate annotation:

  1. OpenAI embeddings find the most similar SKUs from golden annotations using approximate nearest neighbors
  2. These examples serve as in-context examples for GPT-4 to generate labels
  3. Generated annotations fine-tune an LLM for scalable inference

Embedding-based example selection reduces hallucination by providing relevant context. The approach generated annotations in one week that would have required months of manual collection.

Implementation Details

The brand extraction pipeline processes unstructured product descriptions through sequential LLM calls. The first extracts brand entities. The second queries the knowledge graph to prevent duplicates. This two-stage design separates extraction from validation.

The organic labeling pipeline prioritizes speed. String matching handles obvious cases. LLM reasoning handles ambiguous cases. LLM agents handle edge cases requiring external data. Each stage processes only SKUs the previous stage couldn’t confidently label.

The RAG approach for entity resolution uses cosine similarity between embeddings to select examples. The system passes 3-5 similar examples to GPT-4 as few-shot demonstrations. Fine-tuning on generated labels creates a specialized model that runs faster and cheaper than repeated GPT-4 calls.

Downstream Impact

Extracted attributes power personalized ranking models that recommend products matching consumer preferences. Brand and organic tags serve as ranking features. Product category and size enable relevant substitution recommendations when items are out of stock.

Accurate entity resolution enables sponsored ads by connecting manufacturer inventory to consumer searches across merchants.

Next Steps

Most attribute extraction models use text inputs. Product descriptions contain abbreviations and abstractions. Product images offer more consistent quality across merchants.

DoorDash experiments with multimodal LLMs that process text and images together. Current tests explore Visual QA and Chat + OCR approaches. The engineering team builds infrastructure for Dashers to photograph in-store items, enabling direct attribute extraction from physical products.

The ML Platform team builds a centralized model platform that democratizes LLM access. Engineers can prompt-engineer, fine-tune, and deploy LLMs without ML expertise.

Key Takeaways

DoorDash solved the cold-start problem for attribute extraction by using LLMs instead of collecting extensive training data. Their multi-stage pipelines balance accuracy and speed. RAG accelerates annotation by providing relevant context to LLMs. The approach reduced processing time from months to weeks while improving accuracy over manual operators.

Start with string matching for obvious cases. Use LLM reasoning for ambiguous cases. Deploy LLM agents only for edge cases requiring external data. This waterfall design optimizes cost and latency.