Foundation models, due to their size, can be prohibitively expensive to deploy and difficult to adapt to enterprise data tasks. This blog describes our recent research work, in collaboration with the Stanford AI lab, to deploy cost-effective, smaller foundation models that match or exceed the performance of 30x larger foundation models.
Foundation models (FMs) are generative models with billions of parameters trained on vast amounts of text. At inference time, these models can be adapted to a wide variety of tasks through a natural language input text (prompt) provided to the model. Prompts usually contain an instruction for a task and a few-shot set of demonstrations of the task being performed. For instance, a prompt for a translation task could be the instruction “Translate from french to english” followed by demonstrations “fromage: cheese, voiture: car” and the text to translate.
Due to their taskless nature and out-of-the-box performance, FMs are drastically changing the way engineers and domain experts use AI to gain insights from their data. Before FMs, teams of ML-engineers needed to curate large amounts of labeled data to train and deploy a model for a single task, leading to an expensive life cycle of training models before validating their performance. With the aid of FMs, a subject matter expert can quickly prototype tasks expressible through natural language by simply iterating on natural language prompts, a process called prompt engineering.
At Numbers Station, we are building the next-generation of data automation using foundation model-powered technology (see our blogpost on wrangling data with FMs). We tap into FM technology to foster a validate-before-build paradigm where domain experts can prototype data tasks (e.g. data cleaning, data matching, data transformation, …) with a FM using prompting before paying the cost of deployment. This lowers the barrier to entry for existing solutions and enables any data worker to quickly prototype data pipelines without having to write code or label data.
Despite the promise of FMs, harnessing them to fully realize a prototype then deploy workflow comes with a major scalability challenge due to model quality size tradeoffs. FMs performance tends to increase as models get larger; i.e., bigger is better. Large models, however, are more costly than their smaller counterparts, requiring more resources to train and host. Many organizations do not have the required resources to take advantage of the largest, open-source FMs locally and can be prohibited from using FMs that are only available via API access due to privacy concerns.
To address the challenges of deploying FMs at scale, we want to develop a method for using small, open-source FMs without sacrificing performance, as these models can be used at scale and hosted with fewer resources. However, research finds smaller LMs to be increasingly sensitive to slight changes to the input prompt, which correspondingly increases the burden and challenge of prompt engineering. Furthermore, as new FMs are released, the behavior of these models with respect to previously curated prompts can change significantly.
In this blog, we present Ask Me Anything (AMA) prompting, our recent work in collaboration with the Stanford AI lab. AMA is a prompting strategy that supports efficient and effective prompt design. Rather than requiring users to iterate until they find the “perfect prompt”, AMA aggregates the outputs produced by multiple decent prompts.
To make multiple decent prompts, we first understand what makes for a good prompt structure–e.g., is it better to use a multiple choice question, a fill in the blank question, or something else? We then show how to automatically convert task inputs to the effective prompt structures and use weak supervision–a method for aggregating signals from noisy labelers–to aggregate the responses. Using AMA, we are the first to show a path forward for small open-source models to outperform significantly larger models. Specifically, we show a model with 30x fewer parameters exceeds the performance of the few-shot (with demonstrations in the prompt) GPT-3 175B parameter model on 15 natural language benchmark tasks. Notably, the benchmark tasks are selected from the original GPT-3 paper!
To understand what makes a good prompt format, we start our investigation by examining three common prompting methods from the literature.
- traditional free-form questions. E.g., “What movie did John invite Mark to watch?”
- cloze-questions which ask the model to fill in the remaining text. E.g. “John invited Mark to come watch _” and using the FM to fill-the-blank, “Jurassic Park”
- restrictive questions that restrict the model output particular tokens. E.g., “John invited Mark to come watch Jurassic Park. Output True or False?”. The model is restricted to True or False outputs.
We measure performance gains across model sizes and prompt formats as reported in Brown et. al., shown in Figure 2. We find that open-ended question-answer (QA) formats (a and b above) outperform more restrictive formats, especially for smaller models. Intuitively, the task of answering open-ended questions is more aligned with the next-token prediction language modeling objective.
Motivated by the above observation, we develop a two-step prompting pipeline in AMA. First, we cast a task into a two-step question-answer prompt chain. The first step takes the task input and generates the question (either freeform type a or cloze-style type b). The second step answers the question given the input. In AMA, we use the FM itself to both generate the question from the original prompt and answer the generated question, thereby avoiding any need to hand-craft input-specific prompt questions (see Figure 1).
Relying on a single FM prompt can be brittle. In AMA, to alleviate the brittleness, we aggregate performance over multiple question-answer prompt chains. Specifically, we vary a chain through two key levers—the in-context demonstrations and the style of prompt question. Once we have multiple outputs from the multiple prompt chains, we aggregate their responses. Given the varying accuracies and dependencies among prompt chains, in AMA, we draw on recent work in weak supervision [Ratner et al., 2017] [Varma et al., 2019] [Ratner et al., 2018] and learn the dependency structure among prompt chains and aggregate the results. Importantly, our aggregation method does not require any labeled data making it applicable at scale.
We validate AMA on the 20 standard benchmarking tasks used in Brown et al. 2020 and Sanh et al. 2021. We report benchmark results in Table 1 comparing the open-source GPT-Neo-6B and few-shot GPT3-175B. We find that the open-source 6B parameter model prompted with AMA exceeds the average few-shot performance of the GPT3-175B model on 15 of 20 benchmarks. We further evaluate the lift from AMA over out-of-the-box few-shot performance across different sizes of four open-source FMs (EleutherAI, OPT, BLOOM, and T0). Excitingly, the AMA prompt chains apply quite generally and provide lift across models and tasks (see Figure 3).
AMA is one of the first prompting methods that uses off-the-shelf, open-source small FMs and outperforms 30x larger models across a range of natural language benchmark tasks. We hope the strategies in AMA and subsequent work help enable FM applications, especially those prohibited by the scale and potential privacy concerns with using APIs over sensitive data such as data wrangling and QA over private data. At Numbers Station, we are using AMA to build enterprise-ready FMs that achieve state-of-the-art quality on data tasks to help subject matter experts prototype data automation pipelines. Numbers Station further taps into AMA methods to help users deploy smaller and cost-effective FMs. If you are interested in learning more, please contact us or join the Numbers Station waitlist.