What does Zenith Labs do?

Zenith Labs is an AI and software engineering consultancy that builds custom AI systems, LLM integrations, mobile apps (iOS & Android), high-performance web platforms, and data pipelines for businesses in the Gulf, MENA, and US markets.

Who is behind Zenith Labs?

Zenith Labs is founded and run by the principal engineer — an honours Computer Engineering graduate from The American University in Cairo, published AI researcher at IEEE CAI 2026 (sponsored by Microsoft), and former #1 intern out of 1,000+ at the National Bank of Egypt.

How long does a project take?

Most projects are delivered in 4 weeks on average. Zenith Labs follows a four-phase process — Discovery, Architecture, Build, Deploy — with weekly progress updates and a live staging environment throughout.

What industries does Zenith Labs serve?

Zenith Labs serves clients across fintech, banking, logistics, education, healthcare, and enterprise SaaS — primarily in Egypt, UAE, Saudi Arabia, Qatar, Kuwait, and the United States.

How do I start a project with Zenith Labs?

Submit your project brief at zenith-labs.org/start. It is reviewed personally by a senior engineer within 48 hours — no sales teams, no junior handoffs.

AI RESEARCH9 min read

Multimodal LLMs: How AI Systems Read Charts and Reason Over Visual Data

A deep dive into how modern vision-language models process charts, graphs, and figures — and how a table-first reasoning strategy outperformed models 10× its size.

By Zenith Labs — IEEE-published AI researcher & founder of Zenith Labs

TL;DR

Multimodal LLMs combine a vision encoder with a language model to reason over images — but most fail on charts because they treat them as photographs rather than structured data.
A table-first guarded reasoning strategy — extracting tabular structure before applying Chain-of-Thought prompting — achieved 93.96% accuracy on ChartQA, outperforming 72B SOTA models at a fraction of the compute cost.
The key insight: charts are data representations, not images. The right preprocessing layer transforms visual ambiguity into structured certainty before the LLM ever sees it.

01The Problem With Feeding Charts Directly to LLMs

When you ask a standard vision-language model to answer a question about a bar chart, it does something surprisingly naive: it looks at the image as a whole and tries to pattern-match an answer from visual features. This works reasonably well for photographs of objects, but charts are fundamentally different. A chart is a structured encoding of data — every pixel of a bar represents a precise numerical value, every axis label carries semantic meaning, and the spatial relationships between elements are governed by mathematical rules.

Models like GPT-4V and LLaVA, when prompted directly with a chart image, frequently hallucinate values, misread axis scales, and confuse visually similar bars. Benchmark results confirm this: even frontier 70B+ parameter models score below 85% on the ChartQA benchmark, a dataset of real-world chart question-answering tasks. The failure mode is consistent — these models are excellent at visual recognition but weak at visual data reasoning.

02The Table-First Guarded Reasoning Pipeline

The architecture developed for the IEEE CAI 2026 paper — Table-First Guarded Reasoning — addresses this by inserting a structured extraction layer between the raw image and the language model. The pipeline works in three stages.

First, an OCR pass combined with a layout detection model (built on Qwen2.5-VL) extracts the raw numerical and categorical data from the chart and reconstructs it as a Markdown table. This is the 'table-first' step — converting visual data back into the structured form it was originally encoded from.

Second, a Zero-Shot Chain-of-Thought prompt is constructed that presents the extracted table to the language model alongside the original question. The model reasons step-by-step over structured text rather than over pixels — a task it is far better suited for.

Third, a 'guarded' validation layer checks the model's answer against the range of values present in the extracted table. If the answer falls outside plausible bounds, the pipeline triggers a second reasoning pass with an explicit correction prompt. This catches the residual hallucinations that survive the first pass.

03Results and What They Mean

Running this pipeline on the ChartQA benchmark against Qwen2.5-VL-7B as the base model achieved 93.96% accuracy — surpassing published results from models as large as 72B parameters including LLaVA-1.6-34B and InternVL2-40B, both of which scored below 90% on the same benchmark.

The practical implication for businesses is significant: you do not need to deploy an enormous, expensive model to get research-grade accuracy on visual data tasks. A well-engineered preprocessing pipeline combined with a mid-size model can outperform brute-force scaling. This is the core Zenith Labs philosophy applied to research — architecture beats raw compute.

04Where This Technology Applies

Any business that works with charts, dashboards, reports, or figures in PDF form is a candidate for this technology. Financial institutions can automate the extraction of data from analyst reports. Consulting firms can process client-submitted slide decks at scale. Research organisations can index the figures in thousands of academic papers. The underlying insight — that structured extraction before reasoning dramatically improves accuracy — applies broadly across document intelligence use cases.

Multimodal LLMsChartQAQwen2.5-VLOCRChain-of-ThoughtIEEE CAI 2026

Related work:ChartQA Multimodal Pipeline

Building something like this?

Multimodal LLMs: How AI Systems Read Charts and Reason Over Visual Data

01The Problem With Feeding Charts Directly to LLMs

02The Table-First Guarded Reasoning Pipeline

03Results and What They Mean

04Where This Technology Applies

Ready to put this into practice?