AI

GOT-OCR2.0: General OCR Theory Towards OCR-2.0 with Unified End-to-End Model

GOT-OCR2.0 is a unified end-to-end OCR model with 580M parameters handling plain text, math, tables, charts, and sheet music in scene and document images.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
GOT-OCR2.0: General OCR Theory Towards OCR-2.0 with Unified End-to-End Model

Optical Character Recognition has been a solved problem for decades – for clean scanned documents with straightforward text. But the real world of visual content is far messier and more diverse. Mathematical equations with complex notation, tables with irregular cell structures, musical scores with specialized symbols, and scene text on signs and labels all defy traditional OCR approaches that assume clean, linear text on uniform backgrounds.

GOT-OCR2.0 (General OCR Theory, version 2.0), developed by researchers at Ucas-HaoranWei, represents a paradigm shift toward what the authors call OCR-2.0. Instead of the traditional pipeline of detection, segmentation, and recognition modules strung together, GOT-OCR2.0 is a single end-to-end model with 580 million parameters that directly maps image pixels to structured text output.

The model’s unified architecture allows it to handle an extraordinary range of content types. The same model that transcribes a printed page of English text can parse a LaTeX mathematical expression, extract data from a complex HTML table, identify notes on a sheet of music, or read text from a photograph of a street sign. This versatility is achieved without task-specific fine-tuning – the model learns to recognize the content type from the visual features of the input image itself.


How Does GOT-OCR2.0’s End-to-End Architecture Work?

Unlike traditional OCR pipelines, GOT-OCR2.0 uses a single encoder-decoder transformer architecture.

flowchart LR
    A[Input Image\nScene or Document] --> B[Vision Encoder\nViT-Based Backbone]
    B --> C[Cross-Modal\nAttention]
    C --> D[Text Decoder\nAutoregressive Transformer]

    D --> E{Content Type\nClassification}
    E -->|Plain Text| F[Markdown String]
    E -->|Math Formula| G[LaTeX Expression]
    E -->|Table| H[HTML Table Structure]
    E -->|Sheet Music| I[MusicXML / ABC Notation]
    E -->|Chart| J[Text + Data Points]

    F --> K[Structured Output]
    G --> K
    H --> K
    I --> K
    J --> K

The vision encoder processes the input image into feature representations, which are then decoded by an autoregressive text decoder that produces the output token by token. The decoder learns to switch between output formats based on the visual content it sees, outputting LaTeX for mathematical regions, HTML for tables, or plain text for standard paragraphs.


What Content Types and Performance Metrics Does GOT-OCR2.0 Support?

The model’s scope of supported content types is remarkably broad compared to traditional OCR systems.

Content TypeOutput FormatTypical AccuracyTraditional OCR Handling
Printed TextMarkdown string>98% character accuracyWell supported
Mathematical FormulasLaTeX>90% expression accuracyRequires separate Math OCR
TablesHTML + CSS>85% cell-level accuracyRequires table detection
Sheet MusicABC notation>80% note accuracySpecialized OMR required
Scene TextPlain text>92% recognitionRequires scene text detector
Charts & FiguresText + data values>88% key-value accuracyNot typically supported

The unified approach eliminates the compounding errors that plague traditional OCR pipelines, where mistakes in the detection stage propagate through recognition and post-processing. A single end-to-end model optimizes for the final output quality directly.


What Is the Installation and Setup Process?

GOT-OCR2.0 uses standard deep learning tooling and is straightforward to set up.

ComponentRequirementNotes
Python3.9+Core runtime
PyTorch2.0+Deep learning framework
Transformers4.35+HuggingFace model loading
GPU Memory6GB+ (FP16)580M parameter model
Model WeightsAuto-downloadedHosted on HuggingFace

The model supports FP16 inference to reduce memory requirements, making it feasible to run on consumer GPUs. The 580M parameter size represents a sweet spot between capability and resource requirements – large enough to handle diverse OCR tasks, small enough to deploy on a single GPU.


How Does GOT-OCR2.0 Compare to OCR-1.0 Systems?

The transition from OCR-1.0 to OCR-2.0 represents a fundamental architectural shift.

AspectOCR-1.0 (Traditional)OCR-2.0 (GOT-OCR2.0)
ArchitectureMulti-module pipelineSingle end-to-end model
Text DetectionSeparate CNN-based detectorLearned implicitly
Character RecognitionPer-character classifierAutoregressive sequence model
Layout AnalysisSeparate layout parserIntegrated into decoder
Math RecognitionRequires external engineNative capability
Table RecognitionRequires external modelNative capability
Error PropagationCascading errorsMinimized through joint optimization

The end-to-end approach also simplifies deployment. Instead of managing and versioning multiple models (detector, recognizer, layout analyzer, math parser), you deploy a single model that handles everything.


FAQ

What is GOT-OCR2.0? GOT-OCR2.0 is a unified end-to-end OCR model with 580M parameters that handles multiple content types including plain text, mathematical expressions, tables, charts, and sheet music from both scene and document images.

What content types does GOT-OCR2.0 support? GOT-OCR2.0 supports plain text, LaTeX mathematical expressions, HTML-formatted tables, chart text extraction, musical notation recognition, and document layout-aware transcription.

How do I install GOT-OCR2.0? Install via the GitHub repository. The model requires PyTorch and the HuggingFace Transformers library. Pre-trained weights are downloaded automatically from HuggingFace.

Where are the model weights hosted? GOT-OCR2.0 model weights are hosted on HuggingFace Model Hub and are downloaded automatically when you first run the model. Multiple model sizes may be available for different performance requirements.

What makes GOT-OCR2.0 different from traditional OCR? Unlike traditional OCR systems that use separate detection and recognition modules, GOT-OCR2.0 is a unified end-to-end model that directly maps image pixels to text output, handling diverse content types without specialized sub-modules.


Further Reading

TAG
CATEGORIES