2406.04113v1.pdf

Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable benchmark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other proprietary and open-source models perform inadequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS performance and robustness against pseudo-relevant tables (common in retrieval-augmented systems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.

Uncovering Limitations of Large Language Models in Information Seeking from Tables

Chaoxu Pang1,2, Yixuan Cao1,2∗, Chunhao Yang4, Ping Luo1,2,3∗1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China 3Peng Cheng Laboratory, Shenzhen 518066, China 4Harbin Engineering University, Harbin 150001, China{pangchaoxu21b,caoyixuan,luop}@ict.ac.cn, doublehappy@hrbeu.edu.cn

Abstract

Tables are recognized for their high informa-tion density and widespread usage, serving as essential sources of information. Seeking infor-mation from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A sys-tems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable bench-mark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two op-tions per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other pro-prietary and open-source models perform in-adequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS perfor-mance and robustness against pseudo-relevant tables (common in retrieval-augmented sys-tems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.1

1 Introduction

Tables are widespread and rich sources of infor-mation across the web and in various documents. Statistics show that the number of tables on inter-net web pages has reached several hundred mil-lion (Lehmberg et al., 2016); in the corporate envi-ronment, the number of tables in Excel-like spread-sheet files has exceeded 115 million (Wang et al., 2020). Precisely seeking relevant information from

Page title: Audi A8 Section title: Engines

Displacement	Year	Type	Power Torque at rpm
4.2 quattro (4172 cc)	1999	V8	360 PS (265 kW; 355 hp); 430 N⋅m (317 lbf⋅ft)
6.0 (5998 cc)	2001	W12	420 PS (309 kW; 414 hp); 550 N⋅m (406 lbf⋅ft)
Golden Reference:	Audi's 4.2 quattro (4172 cc) has 265 kilowatts (355 hp) and 430 newton metres (317 lb⋅ft).
GPT-3.5 (1-shot) BLEU 34.1 ROUGE 47.4	In the Audi A8, the V8 variant has a 4.2 quattro engine with a displacement of 4172 cc, power of 360 PS (265 kW; 355 hp), and torque of 430 N⋅m (317 lbf⋅ft). ✓
Finetuned model BLEU 66.1 ROUGE 74.2	Audi V8's 4.2 quattro (4172 cc) was developed in 1999, with 309 kw (414 hp) and 430 newton metres (317 lb⋅ft).✘
TabIS – Ensuring Reliability in Benchmarking
Based on the table, what information can you get about V8? A. Audi's 4.2 quattro (4172 cc) has 265 kilowatts (355 hp) and 430 newton metres (317 lb⋅ft). B. Audi V8's 4.2 quattro (4172 cc) was developed in 1999, with 309 kw (414 hp) and 430 newton metres (317 lb⋅ft). Answer: A

Figure 1: Above: A simplified table-to-text genera-tion example illustrating the unreliable evaluation issue. Higher values on surface-level metrics like BLEU and ROUGE do not guarantee better results. Target cells are highlighted. Below: Our benchmark presented in a single-choice format.

tables is crucial for a wide array of real-world appli-cations, including financial analysis, scientific re-search, etc. Recently, the remarkable advancements of Large Language Models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023a; Tou-vron et al., 2023; Google, 2023) have transformed the approach of information retrieval, moving from fetching specific passages to directly providing an-swers. However, the effectiveness of LLMs in seeking information from tables remains underex-plored.

Some efforts have been made to evaluate the capabilities of LLMs in table information seek-ing (TIS), but there are unreliable evaluation is-sues with the used evaluation metrics. Previous studies (Zhao et al., 2023b) mainly use table-to-text generation (TTG) as a test bench to assess the TIS abilities of LLMs. TTG aims at transform-

ing complex tabular data into comprehensible de-multiple tables (M-TIS), i.e. when confronted ad-

scriptions tailored to users’ information seekingditional pseudo-relevant tables. These scenarios

needs. The evaluation relies heavily on surface-reflect common challenges in real-world applica-

level metrics such as BLEU (Papineni et al., 2002)tions, such as chatbots and retrieval-augmented

and ROUGE (Lin, 2004), or on metrics based onsystems.

model predictions such as NLI-Acc (Chen et al.,While previous studies (Zhao et al., 2023b) that

2020a). Given that LLM responses can greatly dif-test on the basic TIS setting with unreliable metrics

fer in style from reference answers, using thesedemonstrate the superiority of LLMs, TabIS reveals

metrics can lead to inconsistent and unreliable eval-the limitations and potential challenges of LLMs

uations. An example of this issue is illustratedin table information seeking as follows.

in Figure 1 where a fine-tuned model’s incorrect

description receives higher BLEU/ROUGE scores• Most LLMs show suboptimal TIS perfor-

than the correct output from GPT-3.5. This discrep-mance, especially in complex TIS scenarios

ancy may occur because GPT-3.5, without beingand when handling tables with rich hierar-

fine-tuned on this specific dataset, might not mimicchies. Experiments on 12 representative LLMs

the style of the reference response.show that only GPT-4-turbo attained an 85.7%

accuracy on average (random guess would be

To provide a more reliable evaluation, this paper50% accuracy). The top-performing 70B open-

introduces a new benchmark for Table Information

Ssource model achieved 74.4% with the resteeking (TabIS). We design our benchmark ,using

falling in the 50-60% range.

a single-choice question format, motivated by pop-

ular benchmarks like MMLU (Hendrycks et al.,• LLMs exhibit a poor understanding of table

2020) and BBH (Suzgun et al., 2022), which utilizestructures, with accuracy fluctuating across

this format to offer a reliable and widely accepteddifferent cell positions. Surprisingly, we find

evaluation of LLMs. We convert TTG datasets likethat LLMs perform almost at random levels in

ToTTo (Parikh et al., 2020) and Hitab (Cheng et al.,basic lookup tasks, such as repeating content

2022) into this format so that the results can bein a specific row. This highlights the substan-

simply and reliably evaluated. A challenge duringtial challenges in real-world SU-TIS scenarios,

curating this benchmark is to generate high-quality

where models struggle to pinpoint the target

options for single-choice questions. Initially, thetable area using only positional cues.

original data’s answer could serve as the correct

option. So we need to generate a deceptive wrong• LLMs struggle to balance between TIS per-

option. If the generated option is too simple, e.g.formance and robustness against pseudo-

with obvious logical errors or unrelated to the ta-relevant tables, especially for open-source

ble content, the benchmark will be too easy andmodels. This indicates a great challenge for

fail to test LLMs’ capabilities. To address this, weLLMs in retrieval-augmented generation sce-

devised three prompting-based methods: Modify-narios.

Input, Modify-Output, and Exam-Judge (detailed

in Section 2.1) for generating wrong options. TheseFinally, we fine-tune Llama2-13b-chat on our

methods together produced a variety of deceptiveweakly-supervised training dataset and find that

options. The manually verified accuracy rate of ourwhile fine-tuning can significantly improve TIS per-

generated data exceeds 92%. We also noted thatformance, boosting from 55.5 to 73.2, it still lags

the Exam-Judge method we proposed generatedbehind GPT-4-turbo, which has not been specifi-

more challenging questions, which may be usedcally fine-tuned. This indicates that the proposed

for future dataset construction.benchmark is non-trivial, calling for further investi-

gations and improvement in this field.

Leveraging the high-quality options, TabIS en-

compasses three practical scenarios with increasing2 TabIS Benchmark

difficulty for table information seeking: (1) basic

TIS derived from TTG (B-TIS), (2) TIS that empha-We curated a benchmark TabIS to investigate the

sizes structural understanding (SU-TIS), i.e. whentable information seeking capabilities of LLMs.

directed to a specific table area with position in-We use table-to-text generation (TTG) datasets

formation (row and column), and (3) TIS fromas the original data source in our benchmark. The

ChatDOC SEO