Due to limited time, I cannot keep track of every new paper very promptly, please feel free to submit a Pull Request to add your papers or submit Issues to remind me, I will add them ASAP. Let's maintain this paper list collaboratively. 🤝
Different types of tables are widely used to store and present information. To automatically process numerous tables and gain valuable insights, researchers have proposed a series of deep-learning models for various table-based tasks, e.g., table question answering (TQA), table-to-text (T2T), text-to-sql (NL2SQL) and table fact verification (TFV). Recently, the emerging Large Language Models (LLMs) and more powerful Multimodal Large Language Models (MLLMs) have opened up new possibilities for processing the tabular data, i.e., we can use one general model to process diverse tables and fulfill different tabular tasks based on the user natural language instructions. We refer to these LLMs speciallized for tabular tasks as Tabular LLMs. In this repository, we collect a paper list about recent Tabular (M)LLMs and divide them into the following categories based on their key idea.
Table of Contents:
- Survey of Tabular LLMs and table understanding
- Prompting LLMs for different tabular tasks, e.g., in-context learning, prompt engineering and integrating external tools.
- Training LLMs for better table understanding ability, e.g., training existing LLMs by instruction fine-tuning or post-pretraining.
- Developing Agents for tabular data, e.g., devolping copilot for processing excel tables.
- RAG with tabular data, e.g., devolping RAG systems for understanding long tables.
- Empirical study for evaluating LLMs' table understanding ability, e.g., exploring the influence of various table types or table formats.
- Multimodal table understanding, e.g., training MLLMs to understand diverse table images and textual user requests.
- Table Understanding datasets and benchmarks, e.g., valuable datasets and benchmarks for model training and evaluation.
- Evaluation Metrics for Table Understanding, e.g., devising better evaluation method for table understanding.
Task Names and Abbreviations:
| Task Names | Abbreviations | Task Descriptions |
|---|---|---|
| Table Question Answering | TQA | Answering questions based on the table(s), e.g., answer look-up or computation questions about table(s). |
| Table-to-Text | Table2Text or T2T | Generate a text based on the table(s), e.g., generate a analysis report given a financial statement. |
| Text-to-Table | Text2Table | Generate structured tables based on input text, e.g., generate a statistical table based on the game summary. |
| Table Fact Verification | TFV | Judging if a statement is true or false (or not enough evidence) based on the table(s) |
| Text-to-SQL | NL2SQL | Generate a SQL statement to answer the user question based on the database schema |
| Tabular Mathematical Reasoning | TMR | Solving mathematical reasoning problems based on the table(s), e.g., solve math word problems related to a table |
| Table-and-Text Question Answering | TAT-QA | Answering questions based on both table(s) and their related texts, e.g., answer questions given wikipedia tables and their surrounding texts. |
| Table Interpretation | TI | Interpreting basic table content and structure information, e.g., column type annotation, entity linking, relation extraction, cell type classification et al. |
| Table Augmentation | TA | Augmenting existing tables with new data, e.g., schema augmentation, row population, et al. |
| Title | Source | Date | Pages |
|---|---|---|---|
| Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation | arxiv | 2025-10-28 | 25 |
| Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence | arxiv | 2025-07-14 | 34 |
| Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution | arxiv | 2024-08-20 | 49 |
| Large Language Model for Table Processing: A Survey | arxiv | 2024-02-04 | 9 |
| A Survey of Table Reasoning with Large Language Models | arxiv | 2024-02-13 | 9 |
| Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey | arxiv | 2024-03-01 | 41 |
| Transformers for Tabular Data Representation: A Survey of Models and Applications | TACL 2023 | 23 | |
| Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks | IJCAI 2022 | 2022-01-24 | 15 |
| Title | Source | Date | Task | Code |
|---|---|---|---|---|
| Structural Deep Encoding for Table Question Answering | ACL 2025 Findings | 2025-03-03 | WTQ, WikiSQL | |
HYTREL: Hypergraph-enhanced Tabular Data Representation Learning |
NIPS 2023 | 2023-07-14 | TA, TI | Github |
| FLAME: A small language model for spreadsheet formulas | AAAI 2024 | 2023-01-31 | Generating Excel Formulas | Github |
| Title | Source | Date | Task | Code |
|---|---|---|---|---|
| TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning | EMNLP 2025 | 2025-06-12 | TQA | Github |
| HD-RAG: Retrieval-Augmented Generation for Hybrid Documents Containing Text and Hierarchical Tables | arxiv | 2025-04-13 | TQA | |
| GTR: Graph-Table-RAG for Cross-Table Question Answering | arxiv | 2025-04-02 | Cross-table Question Answering | |
| TableRAG: Million-Token Table Understanding with Language Models | NIPS 2024 | 2024-10-07 | TQA for extremely long tables | |
| Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications | arxiv | 2024-08-30 | how to represent tables for better retrieval within RAG systems | |
| THoRR: Complex Table Retrieval and Refinement for RAG | IR-RAG 2024 workshop | RAG with large and complex tables |
| Title | Source | Date | Task | Data Volume | Domain | Table Type | Data and Code |
|---|---|---|---|---|---|---|---|
| TabReX : Tabular Referenceless eXplainable Evaluation | arxiv | 2025-12-17 | referenceless evaluation for generated tables | 710 source tables and 9,120 perturbed instances (12 perturbations/table) | Multi-domain (finance, healthcare/clinical, sports, open-domain narrative, hierarchical tables) | Flat and hierarchical tables | Github |
| RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables | arxiv | 2025-11-06 | Reasoning questions | 7,966 questions and 2,031 tables | Scientic and Sports | Flat and complex tables | Github |
| UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data | arxiv | 2025-11-03 | Data analysis | 100 analytical tasks over 223 data files | user behavior, sales, business and so on | csv, database, txt, no sql | |
| MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark | NIPS 2025 | 2025-06-05 | 25 Tabular tasks | 28,136 questions and 61,763 tables | Web tables, spreadsheets and database tables | Flat and complex tables | Github |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | ACL 2025 Findings | 2025-06-18 | QA over over Tables and Charts | 1,000 multiple-choice questions | diverse domains like Economy, Geography, History, Politics, Science, Sport | Github | |
| TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation | ACL 2025 Findings | 2025-05-28 | evaluate generated tables | 50 reference tables + 250 perturbed tables (5 perturbations/table, 16 error types) | Multi-domain (finance, sports, knowledge-base / open-domain) | Flat tables with structural perturbations | Github |
| Are Large Language Models Ready for Multi-Turn Tabular Data Analysis? | ICML | 2025-05-01 | Multi-turn data analysis | 5 common domains such as ATP Tennis and Credit Card | flat tables supporting Pandas operations | Github | |
| GRI-QA: a Comprehensive Benchmark for Table Question Answering over Environmental Data | ACL 2025 Findings | TQA | 4089 questions, 204 tables | environmental | flat and hierarchical tables | Github | |
| 2Columns1Row: A Russian Benchmark for Textual and Multimodal Table Understanding and Reasoning | EMNLP 2025 Findings | Textual and Multimodal TQA in Russian | 28,800 instances | ||||
| NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables | NIPS 2025 | 2025-04-09 | Cell lookup and locating | 750 table and 287K test cases | Wikipedia, statistic reports, and annual reports of airline companies | Flat, hierarchical, horizontal | Github |
| LongTableBench: Benchmarking Long-Context Table Reasoning across Real-World Formats and Domains | EMNLP 2025 Findings | Long-table QA | 5,950 QA instances spanning 7 table format, and input lengths up to 128K tokens, including multi-turn and multi-table settings | 18 domains | flat tables | Github | |
| Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers | EMNLP 2025 Findings | 2025-06-12 | scientific table-based verification | 372 samples | scientic | flat | Github |
| SportReason: Evaluating Retrieval-Augmented Reasoning across Tables and Text for Sports Question Answering | EMNLP 2025 | - | RAG over table and text data for Sports QA | 3,000 QA pairs | Sports | flat table | Github |
| T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables | EMNLP 2025 | 2025-08-27 | Table2Reports | 457 real-world industrial tables | 19 industry domains | four table types | Github |
| MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space | arxiv | 2025-06-13 | Multi-Tabular Reasoning | 3,745 complex question-answer pairs | huggingface | ||
| TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models | arxiv | 2025-06-23 | 26 table-related tasks such as data analysis | 7,790 samples | |||
| TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering | EMNLP 2025 | 2025-06-11 | Data Analysis, Information Retrieval, Numerical Analysis | 617 tables and 2,325 QA pairs | financial reports, industry/stock research reports, academic papers and goverment reports | Flat, hierarchical and complex tables | Github |
| RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis | ACL 2025 | 2025-06-19 | Table analysis over complex tables | 708 tables, 3,752 QA pairs | 24 domains like economy, society, science | complex tables in image and textual format | Github |
| Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights | arxiv | 2025-05-26 | Text2Table | Github | |||
| MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering | EMNLP 2025 | 2025-02-24 | Multilingual Table-and-Text Question Answering | 250 samples | Github | ||
| MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables | ACL 2025 | 2025-02-17 | Insight Generation over Mulitple-Tables | 19,563 tables and 18,532 questions | Tables from SPIDER and Wikipedia | Flat tables | Github |
| TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables | arxiv | 2025-04-02 | TQA over temporally evolving semi-structured tables | 3,971 questions, 14,000 tables | Wikipedia | Infobox tables | Github |
| SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types | ACL 2025 Findings | 2024-12-16 | lookup, numerical reasoning, analysis and tabulation | 953 samples | Github | ||
| MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions | ICLR 2025 | - | Multi-table retrieval, NL2SQL, Multi-table QA, and Key Selection (primary key and foreign key) | 3,312 tables | Wikipedia | Flat tables | |
| SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation | NIPS 2024 | 2024-06-21 | Spreadsheet Manipulation | 2729 spreadsheets, 912 instructions | Excel Forum & Blog | Flat tables, hierarchical tables, multi-tables | Github |
| MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning | COLING 2024 | 2024-12-16 | TQA,T2T,Table manipulation, Data analysis | 1,719 (spreadsheet, question, answer) triplets from 428 different spreadsheets | Multiple domains | Flat and hierarchical tables | Github |
| DocTabQA: Answering Questions from Long Documents Using Tables | arxiv | 2024-08-21 | Table Generation based on question and document | 300 documents and 1.5k question-table pairs | Financial | Flat tables and hierarchical tables | Github |
| Title | Source | Date | Task | Data Volume | Domain | Table Type | Data and Code |
|---|---|---|---|---|---|---|---|
| ENTRANT: A Large Financial Dataset for Table Understanding | Sci Data | 2024-07-04 | Cell Type Classification, Header Extraction, et al | Millions of tables with cell attributes, as well as positional and hierarchical information | Financial | Flat tables and hierarchical tables | Github |
| TableBench: A Comprehensive and Complex Benchmark for Table Question Answering | arxiv | 2024-08-17 | TMR, TFV, Trend Forecasting and Chart Generation | 3681 tables and 20K samples | Collect tables from academic datasets like WTQ and FeTaQA | Flat tables and a small number of hierarchical tables | Github |
| Title | Source | Date | Task | Code |
|---|---|---|---|---|
| Revisiting Automated Evaluation for Long-form Table Question Answering in the Era of Large Language Models | EMNLP 2024 | TQA | ||
| Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text | EMNLP 2024 | 2024-06-21 | Text2Table |