SpeCrawler: Automated OpenAPI Specification Generation from API Documentation Using Large Language Models

API documentation varies wildly across websites, making it difficult to create standardized OpenAPI Specifications (OAS) at scale. SpeCrawler solves this problem by combining rule-based algorithms with large language models to automatically generate comprehensive OpenAPI Specifications from diverse API documentation.

The Problem with Manual OpenAPI Creation

Creating OpenAPI Specifications manually requires significant expert effort and attention to detail. Developers face several challenges:

Structural diversity: API documentation websites use different formats and layouts
Scattered information: Key API components spread throughout documentation pages
Missing details: Critical information often absent or fragmented
Time-intensive process: Manual creation doesn’t scale across numerous APIs

Previous rule-based approaches failed to generalize across diverse documentation structures, while simple LLM approaches struggled with the complexity of generating complete specifications.

SpeCrawler’s Multi-Stage Approach

SpeCrawler breaks down OAS generation into three manageable stages:

1. Intelligent Scraping

The system first extracts request-response pairs from API documentation:

Identifies HTML elements containing cURL commands and JSON responses
Uses heuristic techniques to activate dynamic content
Aligns request examples with corresponding responses using HTML DOM tree navigation
Handles multiple APIs on single documentation pages

2. Base OAS Generation

Rather than generating entire specifications at once, SpeCrawler creates a foundational structure:

Generates skeleton OAS from request examples
Embeds crucial metadata (servers, communication methods, security protocols)
Creates JSON schemas from request/response examples
Fragments large objects into manageable segments to stay within LLM context limits

3. Enrichment with Reference Documentation

The system enhances base specifications with descriptive information:

Automatically finds essential HTML elements containing parameter descriptions
Uses semantic signals rather than specific HTML syntax
Generates structured data (TSV for requests, OpenAPI schemas for responses)
Employs in-context learning with manually labeled examples
Validates output and removes hallucinations

Performance Results

SpeCrawler demonstrated superior performance across multiple metrics:

Base OAS Generation: IBM’s Granite and CodeLlama models achieved the highest valid OAS ratios (73% and 89% respectively) with minimal warnings.

Enrichment Generation: Models successfully retrieved parameter names with high F1 scores (0.94-0.99 for requests, 0.76-0.97 for responses).

End-to-End Comparison: SpeCrawler outperformed GPT-4 Turbo and ActionGPT with 54% precision and 47% recall versus GPT-4’s 18% precision and 5% recall.

Key Technical Innovations

SpeCrawler’s effectiveness stems from several design decisions:

Task decomposition: Breaking complex OAS generation into subtasks
Context management: Filtering documentation to optimize signal-to-noise ratio
Adaptive output formats: Using TSV for flat request parameters and JSON schemas for nested responses
Validation pipeline: Removing hallucinations by verifying parameter names appear in input

Implementation Considerations

When implementing SpeCrawler:

Choose appropriate LLMs based on your specific requirements (Granite excelled overall)
Customize output formats for different models
Implement robust validation to catch hallucinations
Consider the trade-offs between request and response enrichment complexity

Next Steps

SpeCrawler significantly reduces manual effort in creating OpenAPI Specifications while handling diverse documentation structures. The system opens possibilities for:

Large-scale API specification generation
Enhanced tool integration with LLMs
Streamlined API orchestration systems

Start by evaluating SpeCrawler on your API documentation to assess its effectiveness for your specific use cases and documentation formats.

SpeCrawler: Automated OpenAPI Specification Generation from API Documentation Using Large Language Models

SpeCrawler: Automated OpenAPI Specification Generation from API Documentation Using Large Language Models

The Problem with Manual OpenAPI Creation

SpeCrawler’s Multi-Stage Approach

1. Intelligent Scraping

2. Base OAS Generation

3. Enrichment with Reference Documentation

Performance Results

Key Technical Innovations

Implementation Considerations

Next Steps

OpenAI for OpenAPI: Automated Generation of REST API Specifications via Large Language Models

Writing Code is Cheap Now: How AI is Transforming Software Development

Claws: The New Layer on Top of LLM Agents