01The Scale Problem: Why Traditional Approaches Fail
The AUC Digital Library ingestion project involved hundreds of thousands of scanned documents: academic papers, archival texts, reports, and multi-language content spanning decades. The naive approach — running a single OCR pass over each PDF and storing the raw text — produces a result that is technically complete but practically useless. Raw OCR output from scanned documents is riddled with noise: garbled characters from degraded scans, merged words from tight kerning, and completely lost structure from tables and multi-column layouts.
More fundamentally, a keyword search index over raw OCR text cannot answer the questions researchers actually ask. 'Find all papers that discuss neural network generalisation' requires semantic understanding of content, not exact string matching. The engineering challenge was to build a pipeline that could extract clean, structured, semantically-rich content from heterogeneous documents at scale — and do it automatically, without manual review.
02Stage 1: Layout Analysis With Florence-2 and YOLO
Before any text is extracted, the pipeline first segments each document page into semantic regions. A YOLO model fine-tuned on document layout datasets detects bounding boxes for figures, tables, headers, footers, captions, and body text blocks. Florence-2 — Microsoft's vision foundation model — was used for zero-shot region classification and to handle the long tail of unusual layouts that the YOLO model had not seen in training.
This segmentation step is the most important in the pipeline. Once the page is decomposed into typed regions, each region can be processed with the appropriate extraction tool: OCR for text regions, table parsing algorithms for tabular data, and image captioning models for figures. Attempting to OCR a table as if it were prose, or to parse a figure caption with a table extractor, produces nonsense. The layout layer ensures each region reaches the right downstream processor.
03Stage 2: Content Extraction and Metadata Enrichment
Text regions are processed through Tesseract OCR with language-specific models (Arabic and English in the AUC case), with a post-processing step that applies language model-based correction to fix common OCR errors — transposed characters, broken hyphenation, and encoding artifacts common in scanned Arabic text.
Metadata enrichment runs in parallel: NLP models extract named entities (authors, institutions, dates, locations), classify document type (thesis, journal article, report, government document), identify the primary language, and assign subject tags from a controlled vocabulary. This metadata becomes the faceted filter layer in the front-end search interface — enabling users to filter results by document type, date range, subject area, or author without any additional query rewriting.
04Stage 3: Semantic Indexing With Vector Embeddings
The final stage encodes each extracted text chunk as a dense vector embedding using a multilingual sentence transformer model. These embeddings are stored in a vector database (Pinecone in the production deployment) alongside the extracted metadata and original document references.
At query time, the user's search query is encoded with the same embedding model and the vector database returns the most semantically similar chunks — not the chunks that share the most keywords, but the chunks that are most conceptually related. A search for 'the effect of urban density on traffic congestion' will surface relevant documents even if they use the phrase 'metropolitan population concentration and road network saturation'.
The combination of semantic search for discovery and keyword filtering for precision gives researchers a tool that matches how academic investigation actually works: start broad, narrow with filters, verify with exact text.