What does Zenith Labs do?

Zenith Labs is an AI and software engineering consultancy that builds custom AI systems, LLM integrations, mobile apps (iOS & Android), high-performance web platforms, and data pipelines for businesses in the Gulf, MENA, and US markets.

Who is behind Zenith Labs?

Zenith Labs is founded and run by the principal engineer — an honours Computer Engineering graduate from The American University in Cairo, published AI researcher at IEEE CAI 2026 (sponsored by Microsoft), and former #1 intern out of 1,000+ at the National Bank of Egypt.

How long does a project take?

Most projects are delivered in 4 weeks on average. Zenith Labs follows a four-phase process — Discovery, Architecture, Build, Deploy — with weekly progress updates and a live staging environment throughout.

What industries does Zenith Labs serve?

Zenith Labs serves clients across fintech, banking, logistics, education, healthcare, and enterprise SaaS — primarily in Egypt, UAE, Saudi Arabia, Qatar, Kuwait, and the United States.

How do I start a project with Zenith Labs?

Submit your project brief at zenith-labs.org/start. It is reviewed personally by a senior engineer within 48 hours — no sales teams, no junior handoffs.

AI PIPELINES8 min read

AI Document Processing at Scale: OCR, Object Detection, and Semantic Indexing

How to build a production pipeline that ingests hundreds of thousands of documents, extracts structured metadata, and makes them fully searchable — the architecture behind the AUC Digital Library system.

By Zenith Labs — IEEE-published AI researcher & founder of Zenith Labs

TL;DR

Modern document AI pipelines chain three stages: layout analysis (detect regions), content extraction (OCR + visual classification), and semantic indexing (embeddings + vector search).
Florence-2 and YOLO can be combined to segment document pages into semantic regions — figures, tables, headers, body text — before applying targeted OCR to each region, dramatically improving extraction accuracy.
Vector search with dense embeddings enables semantic retrieval across millions of documents in milliseconds — a capability that keyword search fundamentally cannot provide.

01The Scale Problem: Why Traditional Approaches Fail

The AUC Digital Library ingestion project involved hundreds of thousands of scanned documents: academic papers, archival texts, reports, and multi-language content spanning decades. The naive approach — running a single OCR pass over each PDF and storing the raw text — produces a result that is technically complete but practically useless. Raw OCR output from scanned documents is riddled with noise: garbled characters from degraded scans, merged words from tight kerning, and completely lost structure from tables and multi-column layouts.

More fundamentally, a keyword search index over raw OCR text cannot answer the questions researchers actually ask. 'Find all papers that discuss neural network generalisation' requires semantic understanding of content, not exact string matching. The engineering challenge was to build a pipeline that could extract clean, structured, semantically-rich content from heterogeneous documents at scale — and do it automatically, without manual review.

02Stage 1: Layout Analysis With Florence-2 and YOLO

Before any text is extracted, the pipeline first segments each document page into semantic regions. A YOLO model fine-tuned on document layout datasets detects bounding boxes for figures, tables, headers, footers, captions, and body text blocks. Florence-2 — Microsoft's vision foundation model — was used for zero-shot region classification and to handle the long tail of unusual layouts that the YOLO model had not seen in training.

This segmentation step is the most important in the pipeline. Once the page is decomposed into typed regions, each region can be processed with the appropriate extraction tool: OCR for text regions, table parsing algorithms for tabular data, and image captioning models for figures. Attempting to OCR a table as if it were prose, or to parse a figure caption with a table extractor, produces nonsense. The layout layer ensures each region reaches the right downstream processor.

03Stage 2: Content Extraction and Metadata Enrichment

Text regions are processed through Tesseract OCR with language-specific models (Arabic and English in the AUC case), with a post-processing step that applies language model-based correction to fix common OCR errors — transposed characters, broken hyphenation, and encoding artifacts common in scanned Arabic text.

Metadata enrichment runs in parallel: NLP models extract named entities (authors, institutions, dates, locations), classify document type (thesis, journal article, report, government document), identify the primary language, and assign subject tags from a controlled vocabulary. This metadata becomes the faceted filter layer in the front-end search interface — enabling users to filter results by document type, date range, subject area, or author without any additional query rewriting.

04Stage 3: Semantic Indexing With Vector Embeddings

The final stage encodes each extracted text chunk as a dense vector embedding using a multilingual sentence transformer model. These embeddings are stored in a vector database (Pinecone in the production deployment) alongside the extracted metadata and original document references.

At query time, the user's search query is encoded with the same embedding model and the vector database returns the most semantically similar chunks — not the chunks that share the most keywords, but the chunks that are most conceptually related. A search for 'the effect of urban density on traffic congestion' will surface relevant documents even if they use the phrase 'metropolitan population concentration and road network saturation'.

The combination of semantic search for discovery and keyword filtering for precision gives researchers a tool that matches how academic investigation actually works: start broad, narrow with filters, verify with exact text.

Florence-2YOLOTesseract OCRVector SearchPineconeNLPDocument AI

Building something like this?

AI Document Processing at Scale: OCR, Object Detection, and Semantic Indexing

01The Scale Problem: Why Traditional Approaches Fail

02Stage 1: Layout Analysis With Florence-2 and YOLO

03Stage 2: Content Extraction and Metadata Enrichment

04Stage 3: Semantic Indexing With Vector Embeddings

Ready to put this into practice?