SpeCrawler: Automated OpenAPI Specification Generation from API Documentation Using Large Language Models

API documentation varies wildly across websites, making it difficult to create standardized OpenAPI Specifications (OAS) at scale. SpeCrawler solves this problem by combining rule-based algorithms with large language models to automatically generate comprehensive OpenAPI Specifications from diverse API documentation.

The Problem with Manual OpenAPI Creation

Creating OpenAPI Specifications manually requires significant expert effort and attention to detail. Developers face several challenges:

  • Structural diversity: API documentation websites use different formats and layouts
  • Scattered information: Key API components spread throughout documentation pages
  • Missing details: Critical information often absent or fragmented
  • Time-intensive process: Manual creation doesn’t scale across numerous APIs

Previous rule-based approaches failed to generalize across diverse documentation structures, while simple LLM approaches struggled with the complexity of generating complete specifications.

SpeCrawler’s Multi-Stage Approach

SpeCrawler breaks down OAS generation into three manageable stages:

1. Intelligent Scraping

The system first extracts request-response pairs from API documentation:

  • Identifies HTML elements containing cURL commands and JSON responses
  • Uses heuristic techniques to activate dynamic content
  • Aligns request examples with corresponding responses using HTML DOM tree navigation
  • Handles multiple APIs on single documentation pages

2. Base OAS Generation

Rather than generating entire specifications at once, SpeCrawler creates a foundational structure:

  • Generates skeleton OAS from request examples
  • Embeds crucial metadata (servers, communication methods, security protocols)
  • Creates JSON schemas from request/response examples
  • Fragments large objects into manageable segments to stay within LLM context limits

3. Enrichment with Reference Documentation

The system enhances base specifications with descriptive information:

  • Automatically finds essential HTML elements containing parameter descriptions
  • Uses semantic signals rather than specific HTML syntax
  • Generates structured data (TSV for requests, OpenAPI schemas for responses)
  • Employs in-context learning with manually labeled examples
  • Validates output and removes hallucinations

Performance Results

SpeCrawler demonstrated superior performance across multiple metrics:

Base OAS Generation: IBM’s Granite and CodeLlama models achieved the highest valid OAS ratios (73% and 89% respectively) with minimal warnings.

Enrichment Generation: Models successfully retrieved parameter names with high F1 scores (0.94-0.99 for requests, 0.76-0.97 for responses).

End-to-End Comparison: SpeCrawler outperformed GPT-4 Turbo and ActionGPT with 54% precision and 47% recall versus GPT-4’s 18% precision and 5% recall.

Key Technical Innovations

SpeCrawler’s effectiveness stems from several design decisions:

  • Task decomposition: Breaking complex OAS generation into subtasks
  • Context management: Filtering documentation to optimize signal-to-noise ratio
  • Adaptive output formats: Using TSV for flat request parameters and JSON schemas for nested responses
  • Validation pipeline: Removing hallucinations by verifying parameter names appear in input

Implementation Considerations

When implementing SpeCrawler:

  • Choose appropriate LLMs based on your specific requirements (Granite excelled overall)
  • Customize output formats for different models
  • Implement robust validation to catch hallucinations
  • Consider the trade-offs between request and response enrichment complexity

Next Steps

SpeCrawler significantly reduces manual effort in creating OpenAPI Specifications while handling diverse documentation structures. The system opens possibilities for:

  • Large-scale API specification generation
  • Enhanced tool integration with LLMs
  • Streamlined API orchestration systems

Start by evaluating SpeCrawler on your API documentation to assess its effectiveness for your specific use cases and documentation formats.