430 newton metres (317 lb⋅ft).
GPT-3.5 (1-shot) BLEU 34.1 ROUGE 47.4 | In the Audi A8, the V8 variant has a 4.2 quattro engine with a displacement of 4172 cc, power of 360 PS (265 kW; 355 hp), and torque of 430 N⋅m (317 lbf⋅ft). ✓ |
Finetuned model BLEU 66.1 ROUGE 74.2 | Audi V8's 4.2 quattro (4172 cc) was developed in 1999, with 309 kw (414 hp) and 430 newton metres (317 lb⋅ft).✘ |
TabIS – Ensuring Reliability in Benchmarking |
Based on the table, what information can you get about V8? A. Audi's 4.2 quattro (4172 cc) has 265 kilowatts (355 hp) and 430 newton metres (317 lb⋅ft). B. Audi V8's 4.2 quattro (4172 cc) was developed in 1999, with 309 kw (414 hp) and 430 newton metres (317 lb⋅ft). Answer: A |
Figure 1: Above: A simplified table-to-text genera-tion example illustrating the unreliable evaluation issue. Higher values on surface-level metrics like BLEU and ROUGE do not guarantee better results. Target cells are highlighted. Below: Our benchmark presented in a single-choice format.
tables is crucial for a wide array of real-world appli-cations, including financial analysis, scientific re-search, etc. Recently, the remarkable advancements of Large Language Models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023a; Tou-vron et al., 2023; Google, 2023) have transformed the approach of information retrieval, moving from fetching specific passages to directly providing an-swers. However, the effectiveness of LLMs in seeking information from tables remains underex-plored.
Some efforts have been made to evaluate the capabilities of LLMs in table information seek-ing (TIS), but there are unreliable evaluation is-sues with the used evaluation metrics. Previous studies (Zhao et al., 2023b) mainly use table-to-text generation (TTG) as a test bench to assess the TIS abilities of LLMs. TTG aims at transform-
ing complex tabular data into comprehensible de-multiple tables (M-TIS), i.e. when confronted ad-
scriptions tailored to users’ information seekingditional pseudo-relevant tables. These scenarios
needs. The evaluation relies heavily on surface-reflect common challenges in real-world applica-
level metrics such as BLEU (Papineni et al., 2002)tions, such as chatbots and retrieval-augmented
and ROUGE (Lin, 2004), or on metrics based onsystems.
model predictions such as NLI-Acc (Chen et al.,While previous studies (Zhao et al., 2023b) that
2020a). Given that LLM responses can greatly dif-test on the basic TIS setting with unreliable metrics
fer in style from reference answers, using thesedemonstrate the superiority of LLMs, TabIS reveals
metrics can lead to inconsistent and unreliable eval-the limitations and potential challenges of LLMs
uations. An example of this issue is illustratedin table information seeking as follows.
in Figure 1 where a fine-tuned model’s incorrect
description receives higher BLEU/ROUGE scores• Most LLMs show suboptimal TIS perfor-
than the correct output from GPT-3.5. This discrep-mance, especially in complex TIS scenarios
ancy may occur because GPT-3.5, without beingand when handling tables with rich hierar-
fine-tuned on this specific dataset, might not mimicchies. Experiments on 12 representative LLMs
the style of the reference response.show that only GPT-4-turbo attained an 85.7%
accuracy on average (random guess would be
To provide a more reliable evaluation, this paper50% accuracy). The top-performing 70B open-
introduces a new benchmark for Table Information
Ssource model achieved 74.4% with the resteeking (TabIS). We design our benchmark ,using
falling in the 50-60% range.
a single-choice question format, motivated by pop-
ular benchmarks like MMLU (Hendrycks et al.,• LLMs exhibit a poor understanding of table
2020) and BBH (Suzgun et al., 2022), which utilizestructures, with accuracy fluctuating across
this format to offer a reliable and widely accepteddifferent cell positions. Surprisingly, we find
evaluation of LLMs. We convert TTG datasets likethat LLMs perform almost at random levels in
ToTTo (Parikh et al., 2020) and Hitab (Cheng et al.,basic lookup tasks, such as repeating content
2022) into this format so that the results can bein a specific row. This highlights the substan-
simply and reliably evaluated. A challenge duringtial challenges in real-world SU-TIS scenarios,
curating this benchmark is to generate high-quality
where models struggle to pinpoint the target
options for single-choice questions. Initially, thetable area using only positional cues.
original data’s answer could serve as the correct
option. So we need to generate a deceptive wrong• LLMs struggle to balance between TIS per-
option. If the generated option is too simple, e.g.formance and robustness against pseudo-
with obvious logical errors or unrelated to the ta-relevant tables, especially for open-source
ble content, the benchmark will be too easy andmodels. This indicates a great challenge for
fail to test LLMs’ capabilities. To address this, weLLMs in retrieval-augmented generation sce-
devised three prompting-based methods: Modify-narios.
Input, Modify-Output, and Exam-Judge (detailed
in Section 2.1) for generating wrong options. TheseFinally, we fine-tune Llama2-13b-chat on our
methods together produced a variety of deceptiveweakly-supervised training dataset and find that
options. The manually verified accuracy rate of ourwhile fine-tuning can significantly improve TIS per-
generated data exceeds 92%. We also noted thatformance, boosting from 55.5 to 73.2, it still lags
the Exam-Judge method we proposed generatedbehind GPT-4-turbo, which has not been specifi-
more challenging questions, which may be usedcally fine-tuned. This indicates that the proposed
for future dataset construction.benchmark is non-trivial, calling for further investi-
gations and improvement in this field.
Leveraging the high-quality options, TabIS en-
compasses three practical scenarios with increasing2 TabIS Benchmark
difficulty for table information seeking: (1) basic
TIS derived from TTG (B-TIS), (2) TIS that empha-We curated a benchmark TabIS to investigate the
sizes structural understanding (SU-TIS), i.e. whentable information seeking capabilities of LLMs.
directed to a specific table area with position in-We use table-to-text generation (TTG) datasets
formation (row and column), and (3) TIS fromas the original data source in our benchmark. The
1. What are the main objectives of the study on Table Information Seeking (TIS) in Large Language Models (LLMs)?
2. How does the newly introduced benchmark, TabIS, differ from previous evaluation methods for table information extraction?
3. What specific challenges do Large Language Models face when attempting to seek information from tables?
4. What were the performance results of various LLMs tested using the TabIS benchmark, particularly regarding their understanding of table structures?
5. How might the findings from this research influence future developments in Table-to-Text generation and Large Language Models?