Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.83 KB

2407.11418.md

File metadata and controls

20 lines (15 loc) · 2.83 KB

Background

  • Background The paper discusses the lack of high-level abstractions for deploying semantic queries at scale with existing language models (LMs).

  • Existing Work Existing systems and frameworks are insufficient in tackling applications that require bulk semantic processing; they are mainly limited to data retrieval and generation and lack capabilities for complex semantic query patterns over structured and unstructured data, and on performing semantic aggregations or transformations across documents.

Core Contributions

  • Introduced LOTUS, a query engine
    • Challenge 1: How to express semantic queries The paper addresses this challenge through the definition of semantic operators, proposing a declarative programming interface that extends the relational model with composable AI-based operations to facilitate semantic queries over datasets.

    • Challenge 2: How to design an efficient and accurate underlying query system LOTUS tackles this challenge by offering a variety of algorithms and optimizations for the implementation of semantic operators, abstracting away details like model context length limits and choice of algorithms. It enables operations such as semantic filtering, joining, ranking, aggregating, and projecting, which allows programming with natural language expressions.

Implementation and Deployment

LOTUS’ effectiveness has been demonstrated across real-world applications, including fact-checking, extreme multi-label classification, and search. Compared to existing approaches, LOTUS showed great expressiveness, capturing state-of-the-art query pipelines across these diverse applications with minimal development overhead. For instance, on the FEVER dataset, LOTUS replicated a recent leading fact-checking pipeline FacTool in a few lines of code and improved accuracy by 9.5% while reducing execution time by 7 to 34 times. On the BioDEX dataset's extreme multi-label classification task, LOTUS matched top-quality results with its join operator and provided an efficient algorithm running 800 times faster than a naive join. In search and ranking applications, it utilized simple operator compositions to achieve 5.9 to 49.4% higher nDCG@10 than standard retrievers and re-rankers while offering query efficiency, with 1.67 to 10 times lower execution time than LM-based ranking methods used by prior works. LOTUS is publicly available on GitHub.

Summary

The paper presents the LOTUS system, which enables queries based on natural language through the definition of semantic operators, implementing fast and accurate query execution through efficient algorithms and optimizations. LOTUS demonstrates its wide applicability and high performance in multiple real-world application cases, signifying its importance in advancing LM-based large-scale semantic analysis and query systems.