DRAGON-AI: Using Large Language Models and RAG for Automated Ontology Generation
Ontologies power critical biomedical databases and research platforms, but creating and maintaining them demands enormous human effort. Researchers have developed DRAGON-AI, a system that uses Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to automate ontology construction tasks.
The Ontology Construction Challenge
Building ontologies requires domain experts, curators, and ontology editors working together to create structured knowledge representations. Each ontology term needs multiple components: unique identifiers, human-readable labels, textual definitions, and logical relationships connecting terms within and across ontologies.
Currently, most ontology editing involves manual entry of this information using tools like Protégé or spreadsheet-based workflows. While some relationships can be automated through logical reasoning, the majority of ontology construction remains manual work.
How DRAGON-AI Works
DRAGON-AI transforms partial ontology terms into complete ones using a multi-step process:
Vector Indexing and Retrieval
The system creates vector embeddings for existing ontology terms by serializing them as JSON objects containing labels, definitions, and relationships. These embeddings are stored in a ChromaDB database for efficient similarity searches.
To avoid confusing LLMs with non-semantic numeric identifiers (like CL:1001502), DRAGON-AI converts them to camel case format (MitralCell) that resembles natural language.
RAG-Based Prompt Generation
When given a partial term, DRAGON-AI:
- Creates an embedding from the input
- Retrieves the most similar existing terms using vector search
- Applies Maximal Marginal Relevance to diversify results
- Constructs a prompt with relevant examples in JSON format
- Optionally includes GitHub issues and other contextual sources
LLM Processing
The system passes the constructed prompt to an LLM (GPT-4, GPT-3.5-turbo, or open models like nous-hermes-13b), which returns a completed JSON object. Results are parsed and post-processed to remove relationships to non-existent terms.
Performance Results
The researchers evaluated DRAGON-AI across ten diverse ontologies including the Gene Ontology, Cell Ontology, and Human Phenotype Ontology.
Relationship Generation
DRAGON-AI achieved high precision for relationship prediction:
- GPT-4: 89.4% precision, 50% recall for subclass relationships
- GPT-3.5-turbo: 84.6% precision, 41.9% recall
- Performance exceeded OWL reasoning for recall and F1 scores
Definition Quality
Expert evaluators scored AI-generated definitions on biological accuracy, consistency, and overall utility using a 1-5 scale:
- Human-authored: 4.33 accuracy, 4.07 overall score
- GPT-4: 3.97 accuracy, 3.57 overall score
- GPT-3.5-turbo: 4.06 accuracy, 3.63 overall score
AI definitions scored above the “acceptable” threshold of 3.0 but remained statistically lower than human-authored ones.
Expert Confidence Matters
A crucial finding emerged: evaluators with higher domain confidence better detected flaws in AI-generated definitions. Less confident evaluators were more likely to accept AI definitions at face value, suggesting a “gaslighting effect” where AI can mislead novice users.
GitHub Integration Improves Results
DRAGON-AI can incorporate GitHub issue trackers where ontology change requests are discussed. Including this contextual information improved definition quality scores:
- GPT-4 with GitHub: 4.24 accuracy vs 4.04 without
- GPT-3.5-turbo with GitHub: 4.18 accuracy vs 4.07 without
Implementation Considerations
Avoiding Training Data Contamination
To prevent test data leakage, researchers used only ontology terms added after the LLM training cutoff dates. This limited test set sizes but ensured valid evaluation.
Integration Workflows
The researchers envision several integration approaches:
- Plugin for Protégé: AI-powered autocompletion similar to GitHub Copilot
- Tabular editing: Integration with ROBOT templates and spreadsheet workflows
- Conversational interfaces: Text-based interaction for high-level requirement specification
Key Takeaways
DRAGON-AI demonstrates that LLMs can meaningfully assist ontology construction with proper RAG implementation. The system achieves:
- High precision relationship generation (though with moderate recall)
- Acceptable definition quality that approaches human standards
- Ability to leverage diverse knowledge sources including GitHub discussions
However, expert oversight remains essential. The system works best as an autocomplete tool for experienced ontology editors rather than a replacement for human expertise.
The “gaslighting effect” where AI misleads less experienced users underscores the importance of having knowledgeable domain experts drive the ontology generation process, using AI as a productivity enhancement rather than a substitute for human judgment.