DRAGON-AI: Using Large Language Models and RAG for Automated Ontology Generation

Ontologies power critical biomedical databases and research platforms, but creating and maintaining them demands enormous human effort. Researchers have developed DRAGON-AI, a system that uses Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to automate ontology construction tasks.

The Ontology Construction Challenge

Building ontologies requires domain experts, curators, and ontology editors working together to create structured knowledge representations. Each ontology term needs multiple components: unique identifiers, human-readable labels, textual definitions, and logical relationships connecting terms within and across ontologies.

Currently, most ontology editing involves manual entry of this information using tools like Protégé or spreadsheet-based workflows. While some relationships can be automated through logical reasoning, the majority of ontology construction remains manual work.

How DRAGON-AI Works

DRAGON-AI transforms partial ontology terms into complete ones using a multi-step process:

Vector Indexing and Retrieval

The system creates vector embeddings for existing ontology terms by serializing them as JSON objects containing labels, definitions, and relationships. These embeddings are stored in a ChromaDB database for efficient similarity searches.

To avoid confusing LLMs with non-semantic numeric identifiers (like CL:1001502), DRAGON-AI converts them to camel case format (MitralCell) that resembles natural language.

RAG-Based Prompt Generation

When given a partial term, DRAGON-AI:

Creates an embedding from the input
Retrieves the most similar existing terms using vector search
Applies Maximal Marginal Relevance to diversify results
Constructs a prompt with relevant examples in JSON format
Optionally includes GitHub issues and other contextual sources

LLM Processing

The system passes the constructed prompt to an LLM (GPT-4, GPT-3.5-turbo, or open models like nous-hermes-13b), which returns a completed JSON object. Results are parsed and post-processed to remove relationships to non-existent terms.

Performance Results

The researchers evaluated DRAGON-AI across ten diverse ontologies including the Gene Ontology, Cell Ontology, and Human Phenotype Ontology.

Relationship Generation

DRAGON-AI achieved high precision for relationship prediction:

GPT-4: 89.4% precision, 50% recall for subclass relationships
GPT-3.5-turbo: 84.6% precision, 41.9% recall
Performance exceeded OWL reasoning for recall and F1 scores

Definition Quality

Expert evaluators scored AI-generated definitions on biological accuracy, consistency, and overall utility using a 1-5 scale:

Human-authored: 4.33 accuracy, 4.07 overall score
GPT-4: 3.97 accuracy, 3.57 overall score
GPT-3.5-turbo: 4.06 accuracy, 3.63 overall score

AI definitions scored above the “acceptable” threshold of 3.0 but remained statistically lower than human-authored ones.

Expert Confidence Matters

A crucial finding emerged: evaluators with higher domain confidence better detected flaws in AI-generated definitions. Less confident evaluators were more likely to accept AI definitions at face value, suggesting a “gaslighting effect” where AI can mislead novice users.

GitHub Integration Improves Results

DRAGON-AI can incorporate GitHub issue trackers where ontology change requests are discussed. Including this contextual information improved definition quality scores:

GPT-4 with GitHub: 4.24 accuracy vs 4.04 without
GPT-3.5-turbo with GitHub: 4.18 accuracy vs 4.07 without

Implementation Considerations

Avoiding Training Data Contamination

To prevent test data leakage, researchers used only ontology terms added after the LLM training cutoff dates. This limited test set sizes but ensured valid evaluation.

Integration Workflows

The researchers envision several integration approaches:

Plugin for Protégé: AI-powered autocompletion similar to GitHub Copilot
Tabular editing: Integration with ROBOT templates and spreadsheet workflows
Conversational interfaces: Text-based interaction for high-level requirement specification

Key Takeaways

DRAGON-AI demonstrates that LLMs can meaningfully assist ontology construction with proper RAG implementation. The system achieves:

High precision relationship generation (though with moderate recall)
Acceptable definition quality that approaches human standards
Ability to leverage diverse knowledge sources including GitHub discussions

However, expert oversight remains essential. The system works best as an autocomplete tool for experienced ontology editors rather than a replacement for human expertise.

The “gaslighting effect” where AI misleads less experienced users underscores the importance of having knowledgeable domain experts drive the ontology generation process, using AI as a productivity enhancement rather than a substitute for human judgment.

DRAGON-AI: Using Large Language Models and RAG for Automated Ontology Generation

DRAGON-AI: Using Large Language Models and RAG for Automated Ontology Generation

The Ontology Construction Challenge

How DRAGON-AI Works

Vector Indexing and Retrieval

RAG-Based Prompt Generation

LLM Processing

Performance Results

Relationship Generation

Definition Quality

Expert Confidence Matters

GitHub Integration Improves Results

Implementation Considerations

Avoiding Training Data Contamination

Integration Workflows

Key Takeaways

The Integration of Artificial Intelligence and Ontologies: Transforming Knowledge Representation and Application

Agentic Retrieval of Topics and Insights from Earnings Calls

FINTAGGING: Benchmarking LLMs for Extracting and Structuring Financial Information