text-2-sql-cove

This repo implements a basic version of the Chain of Verification (CoVe) approach on the Yale Spider text-to-sql challenge. Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.

Chain of Verification (CoVe) paper: https://arxiv.org/abs/2309.11495

Yale Spider text-to-sql challenge: https://yale-lily.github.io/spider

Approach

Implementation has four stages:

Baseline response (with question similarity matching): This stage uses the question in english and the correponding schema to get a baseline response from the LLM. Important - It uses few-shot examples of english / sql pairs. These examples are dynamically chosen from the training dataset based on the most similar questions to the question for which sql is currently being predicted. Similarity is based on 'bert-base-uncased' embeddings and cosine similarity. See model/baseline.py
Plan verifications: This stage uses the question in english and sql from previous stage as inputs to the LLM (along with a one-shot example). The objective of this stage is to use the LLM to interrrogate the baseline response. The expected response is a set of questions that can be used to investigate whether the columns, joins, etc. used in the baseline response sql are valid. See build_plan() in model/planner.py
Execute verifications: This stage uses the verification questions generated in the previous stage and the schema as inputs. The LLM is expected to answer each verification question using the schema as a knowledge base. See execute_plan() in model/planner.py
Final response: This uses the original english question, the schema, sql from the baseline response, and the set of verification questions and answers from stages 2 and 3. The output is a final sql query. See model/assembler.py

Usage

Clone the repo with git clone https://github.com/cswamy/text-2-sql-cove.git
Navigate into repo folder
Create a python virtunal env with python -m venv venv and activate it with source venv/bin/activate
Install dependencies with pip install -r requirements.txt
Run export OPENAI_API_KEY="<your-openai-key>" to setup OpenAI connection
Create predictions against the dev dataset using python cove.py. 🚨 This will run predictions against the full dev spider dataset (found in data/dev.json). To run it against a smaller sample, use options. For e.g., python cove.py --sample_size 100 --seed 42
Output folder can be found under experiments/ with pred_sqls.txt for predicted sqls, gold_sqls.txt for gold sqls from dev dataset and outputs.json file which is a list of dictionaries containing outputs for each sample in dev dataset. A metrics.json file is also created under experiments/ with some useful statistics (e.g. token counts) for the run
Run cd evaluation to enter folder with evaluation scripts from Spider
Run python evaluation.py --gold '<PATH TO OUTPUT FOLDER>/gold_sqls.txt' --pred '<PATH TO OUTPUT FOLDER>/pred_sqls.txt' --etype 'all' --db '../data/database/' --table '../data/tables.json' > '<PATH TO OUTPUT FOLDER>/result.txt'
This will create a result.txt file in the output folder with evaluation metrics from Spider
Optional: To analyse sql results, go into the experiments folder and run python sql_executor.py --outputfile <PATH TO outputs.json> > <PATH TO OUTPUT FOLDER>/sql_results.txt. This will create an sql_results.txt file under output folder where you can analyse sql outputs

Results

Results to some experiments run on the full dev dataset:

Experiment	Spider execution accuracy
GPT-3.5-Turbo without CoVE and num_matches = 0 (baseline)	42.4
GPT-3.5-Turbo without CoVE and num_matches = 3	64.9
GPT-3.5-Turbo with CoVE and num_matches = 0	41.7
GPT-3.5-Turbo with CoVE and num_matches = 3	63.2

Notes:

Choosing few-shot examples in stage 1 with bert-base-uncased embeddings and cosine similarity has a 2250bps improvement over baseline
CoVe does not seem to add much improvements (at least with current prompts and GPT-3.5-Turbo)
GPT-4 was tested for a small sample (n=100, seed=42) and achieved execution accuracy of 59.0

Potential next steps

Try with a different LLM (e.g. Code LLama)
Test different prompts, especially for baseline and final responses (stages 1 and 4)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-2-sql-cove

Approach

Usage

Results

Potential next steps

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
evaluation		evaluation
experiments		experiments
model		model
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cove.py		cove.py
requirements.txt		requirements.txt

License

cswamy/text-2-sql-cove

Folders and files

Latest commit

History

Repository files navigation

text-2-sql-cove

Approach

Usage

Results

Potential next steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages