Sen Wu, Peng Xu

NSQL-Llama-2-7b: Advancing SQL Foundation Models for Personalized Analytics

This blog introduces NSQL-Llama-2-7b, Numbers Station's new SQL generation foundation model built off of Meta's Llama 2. NSQL-Llama-2-7b achieves up to a 15.5-point execution accuracy improvement over the previous NSQL 6b model while narrowing the performance gap with GPT-4 on standard SQL benchmarks.

We are thrilled to introduce the NSQL-Llama-2-7b model, a SQL generation foundation model (FM) built on top of Meta’s Llama 2. Through extensive training on NSText2SQL data, the NSQL-Llama-2-7b model achieves up to a 15.5-point execution accuracy improvement when compared to our previous NSQL 6b model. Further, we applied the NSQL-Llama-2-7b model to one of our customers’ production workloads and measured a 7.8-point improvement over GPT-4. We are excited about the progress this model represents, as it signifies yet another significant milestone in our mission to empower every enterprise to develop custom analytics copilots that precisely cater to their unique needs.

Open-source FMs for Analytics

At Numbers Station, our goal is to bring the power of FMs to the modern data stack to democratize access to insights, by providing enterprises with models they can fully own and control. In line with this vision, just under a month ago, we launched the initial generation of NSQL models and their corresponding training data. This first step marked the beginning of our efforts to improve the SQL generation performance of open-source models, which enterprises can utilize commercially and further develop. We are thrilled that this release has already garnered more than 4000 downloads from HuggingFace, a testament to the strong interest from the community.

Within days of introducing NSQL, Meta released a new state-of-the-art open-source foundation model known as Llama 2, garnering rapid recognition as one of the best models in the open-source landscape today. Its outstanding performance across diverse benchmarks, combined with a suite of streamlined model sizes in comparison to giants like GPT-3, opens up remarkable possibilities in the field of AI. Moreover, the model’s permissive license empowers enterprises to leverage its potential and build their own private models. Thanks to Meta’s efforts, Llama 2 is now freely accessible for both open-source and commercial use, ushering in a new era of AI innovation and fostering collaborative development within the community.

Building on both the momentum of NSQL and Llama 2, we are excited to unveil the NSQL-Llama-2-7b built on top of Llama 2. To build NSQL-Llama-2-7b, we trained the Llama-2-7b model on our publicly released NSText2SQL dataset while following the same training process as discussed in our previous blog post. In the remainder of this blog, we describe NSQL-Llama-2-7b in more detail by presenting experimental results and a brief technical analysis.

Experimental Results

The results of our experiments are summarized in Table 1. Through our fine-tuning efforts, the NSQL-Llama-2-7b model achieves a 75-point execution accuracy on the Spider benchmark dataset, marking a substantial 46-point improvement over the vanilla Llama-2-7b model. Moreover, compared to our previous NSQL model, the NSQL-Llama-2-7b model showcases a large 11.4-point execution accuracy boost on the Spider benchmark, and a noteworthy 2.2-point execution accuracy advantage over ChatGPT on the same benchmark. This advancement is significant in bridging the performance gap between open-source large language models and closed OpenAI models. For the GeoQuery benchmark, we observe similar execution accuracy1 but measured a substantial 15.2-point improvement in matching accuracy compared to NSQL 6b.

ModelSizeSpiderGeoQuery
Execution AccuracyMatching AccuracyExecution AccuracyMatching Accuracy
Open Source Otherreplit/replit-code-v1-3b3b0.470.3430.2650.109
bigcode/starcoder15.5b0.5770.360.1220.13
mosaicml/mpt-7b7b0.3580.2640.1430.261
facebook/incoder-6b6b0.3160.1470.0210.022
Salesforce/xgen-7b-8k-inst7b0.3930.2730.0610.239
Salesforce/codegen-350m-multi350m0.1020.0670.020.022
Salesforce/codegen-2b-multi2b0.250.170.0610.065
Salesforce/codegen-6b-multi6b0.3210.2260.0610.109
meta-llama/Llama-2-7b7b0.2910.19300.087
meta-llama/Llama-2-13b13b0.4490.3220.1220.261
meta-llama/Llama-2-70b70b0.6150.3540.1220.283
Open Source Ours – PretrainNSQL 350m350m0.1990.1250.0410.043
NSQL 2b2b0.3450.2260.020.043
NSQL 6b6b0.4340.2790.0610.13
NSQL-Llama-2-7b7b0.470.3120.020.13
Open Source Ours – Pretrain + InstructNSQL 350m350m0.5170.4560.1840.043
NSQL 2b2b0.5930.5320.1840.152
NSQL 6b6b0.6360.5740.2650.152
NSQL-Llama-2-7b7b0.750.6630.2650.304
Closed SourceText Davinci 003175b0.720.4990.2450.413
GPT 3.5 0.7280.4420.4690.391
GPT 40.7620.4190.5510.391
Table 1: Execution accuracy and exact match accuracy of Spider and GeoQuery datasets. Dataset splits taken from here. We use the prompt format specified here.

In addition to academic benchmarks, we also evaluated NSQL-Llama-2-7b on a customers’ production workload. This workload consists of 30 hand crafted SQL queries, run in production daily by this customer, and their corresponding natural language descriptions. Each query embeds essential business logic in the form of both single table queries and complex join queries over multiple tables. In Figure 1, we show that on this workload NSQL-Llama-2-7b significantly boosts execution accuracy, surpassing our previous NSQL 6b by 15.5 points and outperforming GPT-4 by 7.8 points. We found that this was because NSQL-Llama-2-7b excels in handling the complex logic and longer contexts common in production workloads. This important customer win underscores the potential of NSQL-LLama-2-7b for wider adoption across enterprises.

Figure 1: NSQL-Llama-2-7b execution accuracy performance on a production customer workload. NSQL-Llama-2-7b outperforms closed models on this customer production workload.

Technical Analysis

The NSQL-Llama-2-7b model, thanks to its SQL-specific pretraining, exhibits better knowledge understanding and improved capabilities in handling long-context scenarios, which is particularly useful on complex and multi-table join queries. 

  • Figure 2 provides a detailed view of these improvements, showcasing a 43% improvement for “Join” queries and a 54% improvement for “Nested” queries in the Spider benchmark.
  • Table 3 provides concrete examples (examples 1, 2, 3, and 4) that showcase the NSQL-Llama-2-7b model’s newfound capabilities. For instance, in example 1, we show that NSQL-Llama-2-7b can now generate ratio-type queries, a feature that was previously absent in our NSQL 6b model. Similarly, in example 2, the model adeptly comprehends complex question logic, generating the correct SQL query and effectively addressing issues with NSQL-6B where it often overlooked vital filters in similar cases. Moreover, examples 3 and 4 demonstrate NSQL-Llama-2-7b’s prowess in understanding the optimal join strategy among three tables and deducing the accurate query logic, overcoming the challenges faced by NSQL 6B in identifying the proper joining pattern or critical group by logic during query generation.
  • When applied to the customer workload detailed above, NSQL-Llama-2-7b achieved a 60% improvement in questions necessitating a large context (1000 tokens or more in our evaluation setting) compared to our previous NSQL 6B model. This establishes the model’s superior handling of long-context scenarios, setting it apart from previous iterations of NSQL models.
Figure 2: Comparison between NSQL 6b and NSQL-Llama-2-7b models on different categories of Spider benchmark test set. “Join” is the subset of test questions that require join SQL queries, “Nested” is the subset of test questions that require nested SQL queries and “Simple” is the subset of test questions that require neither “Join” nor “Nested” SQL queries.

Nonetheless, while NSQL-Llama-2-7b has made significant strides, we have also come across certain challenges. Specifically, the model encounters difficulties in understanding calculations involving date-related columns, an important query type in business analytics for trend analysis (see example 5 in Table 2). To overcome this limitation, we plan to improve the model’s performance by including more date-related query examples in the finetuning dataset.

Furthermore, as emphasized in our previous blog, there are instances where NSQL models may require domain-specific information to generate precise queries (illustrated by examples 6 and 7 in Table 2). To address this issue and enhance the model’s understanding of domain knowledge and business semantics, we offer finetuning on domain-specific data to our customers, as this is a critical step in gaining organizational insights in each distinct domain.

CREATE TABLE work_order (
    id NUMBER,
    property_id NUMBER,
    cost FLOAT,
    invoice_amount FLOAT,i
    entered_date DATE,
    due_date DATE,
    complete_date DATE,
    cancel_date DATE,
    is_canceled BOOLEAN,
    is_completed BOOLEAN,
    is_open BOOLEAN,
    order_type TEXT,
    assigned_to TEXT
)

CREATE TABLE property (
    id NUMBER,
    property_name TEXT,
    area FLOAT,
    owner_id NUMBER,
    city TEXT,
    country TEXT
)

CREATE TABLE owner (
    id NUMBER,
    name TEXT,
    salary FLOAT
)
ExampleQuestionNSQL-Llama-2-7bNSQL 6B
1What is the completion percentage in the work orders for each order type?SELECT order_type, SUM(is_completed) * 100 / COUNT(*) FROM work_order GROUP BY order_typeSELECT order_type, SUM(complete_date) * 100 / SUM(due_date) FROM work_order GROUP BY order_type;
2How many open work orders for the largest property in Chicago?SELECT COUNT(*) FROM work_order AS T1 JOIN property AS T2 ON T1.property_id = T2.id WHERE T2.city = ‘Chicago’ AND T1.is_open = 1 AND T2.area = (SELECT MAX(area) FROM property WHERE city = ‘Chicago’)SELECT COUNT(*) FROM work_order AS T1 JOIN property AS T2 ON T1.property_id = T2.id WHERE T2.city = ‘Chicago’ AND T1.is_open = 1;
3What is the average work order cost for the properties by each owner?SELECT AVG(t1.cost), owner_id FROM work_order AS t1 JOIN property AS t2 ON t1.property_id = t2.id GROUP BY owner_idSELECT AVG(T1.cost), T2.name FROM work_order AS T1 JOIN owner AS T2 ON T1.owner_id = T2.id GROUP BY T2.name;
4Who has the property that has the most number of work orders?SELECT T1.name FROM owner AS T1 JOIN property AS T2 ON T1.id = T2.owner_id JOIN work_order AS T3 ON T2.id = T3.property_id GROUP BY T2.id ORDER BY COUNT(*) DESC LIMIT 1SELECT T1.property_name FROM property AS T1 JOIN work_order AS T2 ON T1.id = T2.property_id GROUP BY T1.property_name ORDER BY COUNT(*) DESC LIMIT 1;
5How many work order got cancelled within 3 months before the due date?SELECT COUNT(*) FROM work_order WHERE cancel_date <= due_date – 3SELECT COUNT(*) FROM work_order WHERE cancel_date BETWEEN due_date – 3 AND due_date;
6What is the expected profit of open work orders?SELECT SUM(invoice_amount) FROM work_order WHERE is_open = 1SELECT SUM(invoice_amount) FROM work_order WHERE is_open = 1;
7What is the expected profit of open work orders? Profit is total difference of invoice and cost across orders.SELECT SUM(invoice_amount – cost) FROM work_order WHERE is_open = 1;SELECT SUM(invoice_amount – cost) FROM work_order WHERE is_open = 1;
Table 2: Examples of query generations using NSQL models. The input table schemas are defined in Figure 3.

Summary

With the release of the NSQL-Llama-2-7b model, we invite data practitioners and AI enthusiasts to join us in shaping the future of AI-driven analytics. You can access NSQL-Llama-2-7b, along with weights, by visiting Hugging Face here or by trying out one of our example notebooks here. Together, we can continue to bridge the divide between cutting-edge AI capabilities and accessible, enterprise-owned models, fostering innovation and driving the data ecosystem forward. At Numbers Station, we are dedicated to driving positive change in the AI landscape, and we look forward to the impact of this next step in our journey.

If you are interested in bringing the power of NSQL-Llama-2-7b and related analytics FMs to your enterprise in a personalized manner, see more at https://www.www.numbersstation.ai/. Join us on this exciting journey of transforming the world of data analytics and decision-making.

1. The execution accuracy did not improve due to case sensitivity in the emitted SQL (e.g. the model emits “California” when the query requires “california”). We plan to address this issue in future releases. Regardless, the exact match accuracy provides a more accurate representation of the model’s overall improvement.