01The Problem With Feeding Charts Directly to LLMs
When you ask a standard vision-language model to answer a question about a bar chart, it does something surprisingly naive: it looks at the image as a whole and tries to pattern-match an answer from visual features. This works reasonably well for photographs of objects, but charts are fundamentally different. A chart is a structured encoding of data — every pixel of a bar represents a precise numerical value, every axis label carries semantic meaning, and the spatial relationships between elements are governed by mathematical rules.
Models like GPT-4V and LLaVA, when prompted directly with a chart image, frequently hallucinate values, misread axis scales, and confuse visually similar bars. Benchmark results confirm this: even frontier 70B+ parameter models score below 85% on the ChartQA benchmark, a dataset of real-world chart question-answering tasks. The failure mode is consistent — these models are excellent at visual recognition but weak at visual data reasoning.
02The Table-First Guarded Reasoning Pipeline
The architecture developed for the IEEE CAI 2026 paper — Table-First Guarded Reasoning — addresses this by inserting a structured extraction layer between the raw image and the language model. The pipeline works in three stages.
First, an OCR pass combined with a layout detection model (built on Qwen2.5-VL) extracts the raw numerical and categorical data from the chart and reconstructs it as a Markdown table. This is the 'table-first' step — converting visual data back into the structured form it was originally encoded from.
Second, a Zero-Shot Chain-of-Thought prompt is constructed that presents the extracted table to the language model alongside the original question. The model reasons step-by-step over structured text rather than over pixels — a task it is far better suited for.
Third, a 'guarded' validation layer checks the model's answer against the range of values present in the extracted table. If the answer falls outside plausible bounds, the pipeline triggers a second reasoning pass with an explicit correction prompt. This catches the residual hallucinations that survive the first pass.
03Results and What They Mean
Running this pipeline on the ChartQA benchmark against Qwen2.5-VL-7B as the base model achieved 93.96% accuracy — surpassing published results from models as large as 72B parameters including LLaVA-1.6-34B and InternVL2-40B, both of which scored below 90% on the same benchmark.
The practical implication for businesses is significant: you do not need to deploy an enormous, expensive model to get research-grade accuracy on visual data tasks. A well-engineered preprocessing pipeline combined with a mid-size model can outperform brute-force scaling. This is the core Zenith Labs philosophy applied to research — architecture beats raw compute.
04Where This Technology Applies
Any business that works with charts, dashboards, reports, or figures in PDF form is a candidate for this technology. Financial institutions can automate the extraction of data from analyst reports. Consulting firms can process client-submitted slide decks at scale. Research organisations can index the figures in thousands of academic papers. The underlying insight — that structured extraction before reasoning dramatically improves accuracy — applies broadly across document intelligence use cases.