Numbers Station Labs releases Meadow—the simplest agentic framework built for complete data workflows.
Laurel Orr, Xiao Ling, Vishal Motwani

SQL Coding Assistants Customized to Enterprise Logs

Numbers Station wants to bring the productivity of foundation models (FM) for code to the data analyst and help them generate SQL to transform their data. SQL FMs need to be heavily personalized to the data and infrastructure the analyst interacts with, but customizing FMs is difficult because they are difficult to prompt, prohibitively expensive to finetune over enterprise data, and not applicable over private data. Numbers Station introduces Data Transformation Assistant to help the enterprise customize their own smaller SQL FM that can be deployed locally. Numbers Stations produces competitive models on two benchmark datasets – even outperforming GPT-4 - and can be more than 2,000 times cheaper than conventional approaches.

Introduction

Foundation models (FMs) for code, such as OpenAI’s GPT Codex or Salesforce’s CodeGen, are large, pretrained models that can generate entire blocks of code given a natural language instruction or partial code input. Code FMs have already had a massive impact on programmer productivity. Companies have started using code FMs to help users query over data, generate data visualizations, and develop front end designs. One of the most widely successful use cases has been CoPilot – an embedded coding assistant backed by Codex – that claims to code up to 30% of new user code. Code FMs have the potential to massively improve programmer productivity and code quality.

Code Assistants for the Data Analyst

For data analysts who spend the majority of their time generating data transformation pipelines, often via SQL queries, code FMs promise to facilitate their workflow. With a code FM, a user with limited knowledge of SQL can simply describe the output table they want or the question they want to answer, and the model can help them generate the query. We’re already seeing exciting work in this space with Adept and Hex building natural language to SQL solutions on top of OpenAI and open source efforts such as Salesforce’s Picard achieving state-of-the-art results on text to SQL benchmarks by finetuning smaller FMs.

Existing FM solutions for SQL coding assistance, however, are difficult to customize via prompting or training to meet the personalization needs of the data analyst. As CoPilot demonstrated, successful coding assistants need to be easily adaptable to the individual user and workflow to be successful. For a data analyst, the FM will need to understand the enterprise data and infrastructure the analyst engages with as well as any idiosyncratic coding styles of the user. Achieving a high level data personalization with a FM is challenging for three reasons.

  1. Difficult to Customize with Prompt Engineering: The canonical method for customizing FMs is through prompt engineering and painstaking demonstration construction. Prompting is a notoriously fickle and sensitive process. For example, just changing the order of examples in the prompt can take performance on a sentiment classification task from near random to over 85% accuracy. Even with manual prompt engineering, FMs can still underperform finetuned models. For SQL code generation, manually prompted Codex still underperforms state-of-the-art models on text to SQL benchmarks by up to 16 pts of accuracy.
  2. Expensive to Customize with Finetuning: An alternative method for customization is finetuning a FM either via open source training or through training APIs. Training, however, requires prohibitively expensive ML expertise and manual effort from a SQL expert to generate high quality training data. For example, generating the training data for the popular Spider text to SQL benchmark took 1,000 person hours.
  3. Not Owned or Deployed by the Enterprise: Many closed source FMs are available only via APIs. Enterprises with sensitive data and code are reluctant to send data out of their network and are therefore hesitant, and in the worst case prohibited, from using closed source FMs. Enterprises with high privacy requirements need SQL assistants that can be cheaply hosted within their network.

The Numbers Station SQL Transformation Solution

At Numbers Station, our goal is to use FM technology to help the data analyst prepare, explore, and analyze their data more efficiently. The Numbers Station Data Transformation Assistant uses FMs to both enhance the capabilities of the user and make them more productive. Our prototyping platform (described in post) provides users without any ML expertise the ability to build classification and extraction pipelines with natural language. Our SQL transformation solution, the focus of this post, helps users build small, customized SQL code assistants that are small enough to be deployed locally with a single GPU.

Figure 1: The Numbers Station SQL Transformation Solution workflow. Using a pool of FMs, Numbers Station weakly labels SQL queries from company-wide log data to customize a small, open source SQL coding assistant. As logs continue to be updated, we continually feed in-context demonstrations to the assistant in real time to adapt to new usage patterns and queries.

Numbers Station builds on the incredible work of open source code generation models to help the analyst (or enterprise) build a small SQL coding assistant FM. Our insight is twofold:

  1. SQL logs are prevalent but do not have text instructions associated with them.
  2. Using a FM to go from SQL to text (i.e. SQL summarization) is much easier than text to SQL.

With company SQL query logs, Numbers Station uses existing general purpose FMs to automatically generate text questions for each query (see Figure 1). We then train smaller, personalized generative code FMs on the weakly labeled data. To continuously update and personalize the model, at inference time, we retrieve from company logs to give the model demonstrations of relevant queries.

Our method addresses the above challenges by:

  1. Customized to Company Logs with no Engineering Effort. Instead of requiring SQL experts to manually generate text instructions from the query logs, we use existing general purpose, pretrained text FMs, such as GPT, to weakly annotate the queries for training. No engineer effort is required.
  2. Continuously Adaptable with Knowledge from the Entire Organization. To avoid prompt engineering, we retrieve from the company log during inference to provide further guidance to the model. As logs are continuously updated, once a single user “figures a query out”, that knowledge can instantly be transferred to other analysts doing similar tasks across the entire organization.
  3. Small and Locally Deployed. Our SQL models are small generative FMs customized from open source code FMs that are up to 1/60th of the size of OpenAI’s GPT models. Not only can an enterprise easily and cheaply host the model, but also analysts could even run inference on a local desktop or laptop.


Numbers Station by the Numbers

To demonstrate the quality of the Numbers Station SQL transformation solution, we experiment on two benchmark datasets: Spider and a custom DBT benchmark built on public DBT projects mined from GitHub and manually annotated. We first show how we achieve state-of-the-art quality when customizing a code FM to log data and adapting to new databases and queries at inference time. We then demonstrate our FM-labeled text to SQL data is comparable to manually labeled training data. We finally ablate our solution’s ability to continuously learn from new logs over time, gaining 14 pts of accuracy as log data grows.

Numbers Station SQL transformation solution can achieve state-of-the-art performance.

To evaluate overall performance, we customize a 2.7B GPT Codegen model (60x smaller than OpenAI’s GPT-3) to Spider training data using gold labels and evaluate on Spider validation data which consists of entirely unseen databases and on our DBT benchmark. We compare against existing private FMs, open source FMs, and the state-of-the-art fine-tuned T5 FM (without any architectural modifications) from Scholak et. al..[1]

On Spider, as shown in Table 1, we outperform all baseline models, including GPT-4. While we are 0.3 pts higher than GPT-4 in terms of execution performance, we far surpass GPT-4 in terms of matching performance, achieving 30 pts lift. Numbers Station’s high performance indicates our model’s ability to pick up on preferred query formats and patterns from the logs (e.g. using BETWEEN versus two OR conditions).

Table 1: Performance comparison across models on the Spider validation set. We use Spider exact match accuracy and Spider execution accuracy as implemented in Picard.

On the DBT benchmark (Table 2), our model outperforms all baseline models. In particular, we outperform the Spider fine-tuned T5 model by over 39 pts of PCM F1, indicating the T5 model is dramatically overfit to the Spider data.

Table 2: Performance comparison across models on our DBT benchmark. We use exact match accuracy and PCM F1.

Numbers Station automatic data generation method is comparable to the expert labeled SQL queries and can be 2,000x cheaper.

We compare how the Numbers Station weak labeling method of generating training data using pretrained FMs compares to gold manually labeled data. In Table 3, we show the cost and performance of the baseline code FM without any customization, customization over weakly labeled logs, and customization over gold data. When we compare the cost of using FMs to generate data versus having SQL experts manually annotate queries, we are over 2,000 times cheaper and lose no execution accuracy.

Table 3: Cost and performance comparison of our un-customized model, our weakly labeled customized model, and our model customized over manually labeled data. Inference is done with manually labeled logs for the database being queries.


As user log data grows, Numbers Station’s models continue to learn from the logs.

A key benefit of customizing a generative model is its ability to be continuously updated via demonstrations retrieved in real time from company-wide data. We demonstrate how the Numbers Station’s SQL assistant continuously adapts to user log data even after being initially customized. We take our weakly labeled trained model and evaluate Spider performance retrieving weakly labeled logs as the log data increases. When we have no log data for the relevant database, our platform generates a seed set of queries with a pretrained FM. As shown in Figure 2, from using no log data (zero demonstrations) to using 8 demonstrations of logs over the same database as the query, we see a 14 pts improvement in execution accuracy. Further, we are only 1.7 pts behind using manually labeled logs for inference in terms of execution accuracy.

Figure 2: Spider execution performance of the weakly labeled customized model as the company log data grows. As more logs are generated, we weakly label the query with a relevant text question and retrieve the most relevant logs to use as demonstrations during inference. If there are no logs for the database being queried, we generate a seed set of queries using a FM (left side). If there are logs for the query database, we retrieve from those (right side). The top dotted line shows the performance if we used manually labeled logs from the query database.

Conclusion

In this post, we explained how Numbers Station provides an effective platform for customizing small, text to SQL coding assistant FMs that can be deployed locally within an enterprise’s network. Our SQL coding assistants learn from company log data and help the analyst generate SQL more efficiently and easily. Numbers Station wants to bring FM technology to the data analyst beyond SQL generation. We explain the Numbers Station Data Transformation Assistant here and how we use pretrained FMs to help analysts automate data transformation tasks such as classification, cleaning, and extraction. If interested in learning more, please email info@numbersstation.ai.

[1] To ensure no true answer is leaked into the retrieval set for inference, we remove all retrieved examples that have the same text instruction or SQL query as the evaluation example.  For the pretrained FMs, we use the text to SQL prompting format from Rajkumar et. al.