Peng Xu, Chris Lin, Sen Wu

Prototype then Deploy: shrinking foundation model quality into small models

Foundation Models (FMs) such as GPT-4 promise a natural language interface for data analysts to perform complicated transformation tasks and gain insights from their data. Despite the wide success of FMs, there are still high barriers to use on enterprise data such as hard to prompt, privately owned, and a high cost to run (a single run of 1M rows could cost more than $3.7K!). At Numbers Station, our goal is to bring state of the art FM technology to all data analysts. With a few natural language examples from an analyst, Numbers Station customizes a small FM on the analyst’s task and data. Our smaller FMs match the capabilities of proprietary large FMs and can be up to 2000x times cheaper. At Numbers Station, we are excited to take the first step towards making FM technology accessible to all enterprise analysts.


Foundation Models (FMs) are changing the way engineers and analysts interact with their data. FMs are massive models trained over terabytes of text corpora. FMs present emergent capabilities and can perform a wide variety of tasks through simple natural language instructions. For example, to build a sentiment analysis system before FMs, teams of engineers had to spend weeks if not months curating data and training a classifier. With FMs such as GPT-4 and ChatGPT from OpenAI, PaLM from Google, OPT and LLaMA from Meta, and BLOOM and GPT-NeoX from the open source community, a single engineer can simply provide a prompt like “Classify the sentiment of the tweet as positive or negative” followed by a tweet to the FM. The FM will generate the sentiment of the example.

Promises of Foundation Models for Data Analytics

The application of FMs for data analytics promises to bring about a significant transformation in the way an analyst processes and gains insights from data. With a FM, an analyst without any ML expertise can “talk” to a FM to perform tasks that were previously labor-intensive or beyond the skill set of the analyst such as data cleaning, information extraction, or classification (see our blogpost on how we use a FM to perform data wrangling tasks). Task simplification and automation allow analysts to spend more time performing the task that matters the most – analyzing the data and making business decisions – rather than where they currently spend most of their time – cleaning and preprocessing data. 

Challenges for Data Analyst with Foundation Models

Applying FMs to data analytics, however, comes with three challenges:

  1. Prompting is still an art: While FMs are incredibly powerful, they can also be very finicky and brittle to prompt. Small wording or formatting changes to the prompt can result in significantly different outputs that can be frustrating for analysts. For example, small differences in prompts were shown to cause up to a 39pts performance drop in sentiment analysis.
  2. Lack of model ownership: Many of the popular FMs, e.g. OpenAI’s GPT-4 or Anthropic’s Claude, are proprietary models. This prevents FM use for environments with sensitive or private enterprise data. It also puts the analyst at the mercy of the availability of the FM company, which can be unrealiable.
  3. Expensive to run: FMs are not cheap. These models usually have billions of parameters and cost millions to serve. In a best case scenario, running inference over 1M rows of shorter context user data could cost $3.7K. Alternatively, if the user data is a long passage and requires the 4K-token context length of GPT-4, inference can cost more than $122K. For an analyst running over large-scale datasets multiple times a day, the costs can quickly become prohibitively expensive​​.

The Numbers Station AI Transformation Solution

Figure 1: The Numbers Station AI Transformation Solution workflow. Users start by defining and giving feedback on a task over their data. We use the feedback to generate training data to customize a small FM specialized to their task and deploy it over their entire data.

At Numbers Station, we are building Data Transformation Assistant that brings FM technology to the data analyst. Through natural language and a few examples, an analyst with limited technical knowledge can clean, preprocess, and transform their data. The core insight at Numbers Station is to use the FM for its ability to quickly prototype a task in natural language. Once the prototype meets the user requirements, typically in terms of data quality, Numbers Station crosses the chasm to production, automatically generating a small, customized model specifically for the task. The resulting model can be deployed at scale and provide efficient inference over user data. Numbers Station’s validate-then-build workflow addresses the above challenges by

  1. Intelligent user feedback: To overcome prompting brittleness, Numbers Station gathers feedback from the user on how the FM is performing on the task. We use model ensembling to more robustly use the feedback to improve the FM.
  2. Customized model: When the FM performs adequately on a user task, Numbers Station selects a small model from available open source models and uses the pretrained FM to generate training data to customize the small model to the user’s task. This model is owned by the analyst and can be deployed within their network.
  3. Cheap inference: Our small, customized models are generally a hundred to a thousand times smaller than popular FMs and therefore cheaper to run at scale – all while achieving comparable performance to the large, pretrained FM.

Numbers Station by the Numbers

Below, we present several use cases from benchmark datasets to show that the Numbers Station prototype-then-deploy platform customizes a high-quality model to each use case from user feedback and enables production-ready inference at a low cost.

Numbers Station’s prototyping method improves standard prompting performance across closed and open source FMs.

We first evaluate our FM prototyping method across closed and open source FMs on the RAFT benchmark. RAFT consists of 11 real-world text classification tasks. Using their leave-one-out evaluation (see Table 7 in Sec A.6), we find that the Numbers Station prototype platform using OpenAI’s GPT-4 is 15.2 pts higher on average across the RAFT tasks with up to a 26.7 pts lift on the banking_77 task than the reported OpenAI GPT-3 prompting. We find that the Numbers Station prototype platform is general and provides lift across all evaluated FMs (OpenAI FMs, Together’s GPT-JT FM, and Eleuther’s GPT-NeoX FM). Finally, on a real customer dataset that has long, messy contexts, the Numbers Station prototyping method achieves 28 pts lift compared to standard prompting methods such as those used in RAFT.

Figure 2: Performance comparison across models between NS prototype AI model and standard prompting on the RAFT benchmark.

Numbers Station’s customization produces models 800 times smaller than GPT-3 while maintaining the same level of quality.

We evaluate the performance of the model customization process of the Numbers Station platform. Recall that once the user has provided enough task feedback, the Numbers Station platform selects a smaller model from available open source models and customizes it to the user’s task. In this experiment, Numbers Station selects a T5-based model (with 250M parameters) and uses a general pretrained FM to generate the training data to perform customization. Across 7 standard data wrangling tasks (among imputation, classification, and extraction) we find that the customized Numbers Station model, which is 800 times smaller than GPT-3 (reported as 175B parameters), is within 1 pt on average of the pretrained FM’s performance during Numbers Station prototyping. We further find that using the same amount of user feedback, our customized model surpasses a baseline method of fine tuning a small model directly on the feedback by 6.3 pts. 

Table 1: Performance comparison among NS Custom AI model, NS Prototype AI model, and Standard T5-base Model fine tuned on the same amount of labeled data on 7 standard data wrangling tasks.

Numbers Station’s customization produces models cheaper to serve and easy to scale

To illustrate the affordability and scalability of the Numbers Station customized model, we ran inference of up to 1M records of an Amazon sentiment analysis dataset. We compared the serving and inference cost on AWS EC2 instances compared to the cost of OpenAI’s inference API. Numbers Station’s inference cost less than $1.7 compared to OpenAI’s $3.7K for GPT-4 cost for 1M rows, a 2275x cost reduction (Table 2).

Table 2: Inference cost comparison among Numbers Station’s custom model and OpenAI’s API endpoint.


In this post, we explained how Numbers Station provides an effective platform for allowing an analyst to easily and cheaply leverage FM technology at scale to perform data wrangling tasks. Numbers Station goes beyond just data wrangling and applies FM technology to other painful tasks analysts perform daily. Our post on building a smaller, customized text to SQL models helps the analyst more efficiently generate SQL queries with their own SQL coding assistant. Stay tuned for more posts on how we tackle record linkage. If interested in learning more, please email

You May Also Like
Media Contacts

Are you a media outlet interested in Numbers Station? Drop us a note and we'll be in touch.