Long Form Factuality

How factual are the LLMs when generating answers to open ended questions?

Created: Mar 28, 2024 by Pradeep Gowda Updated: Mar 28, 2024 Tagged: LLM

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics.

This paper by Google Deepmind – Wei, Jerry, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, et al. “Long-form factuality in large language models,” 2024. – proposes a method which they call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, they propose extending F1 score as an aggregated metric for long-form factuality. To do so, they balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyper-parameter representing a user’s preferred response length (recall).

See also: TruLens for LLMs

Create credible and powerful LLM apps, faster. TruLens is a software tool that helps you to objectively measure the quality and effectiveness of your LLM-based applications using feedback functions. Feedback functions help to programmatically evaluate the quality of inputs, outputs, and intermediate results, so that you can expedite and scale up experiment evaluation. Use it for a wide variety of use cases including question answering, summarization, retrieval-augmented generation, and agent-based applications.