Numbers Station Labs releases Meadow—the simplest agentic framework built for complete data workflows.
Chris Lin

Navigating the Evaluation of Multi-Agent Analytics

A multi-agent design requires thoughtful evaluation of all participating agents. Numbers Station has built a proprietary framework to dynamically evaluate the various outputs of each highly-specialized agent within our system.

At Numbers Station, we understand that data is difficult. And we don’t just mean writing SQL. We mean everything about your data is hard – from understanding what your data contains to deducing the correct action to take. At Numbers Station, we take a multi-agent approach to make data analysis easier. 

Our platform goes far beyond Text-to-SQL and empowers users to understand their data, ask the right questions, and act accordingly. As we build out an open-ended platform, one key challenge we’ve faced is accurate evaluation – how can we ensure that we consistently output the best possible response in order to efficiently derive powerful business insights for our users?

Let’s take a look at some examples.

Suppose the CEO of a large retailer, Colin, is using Numbers Station to drive profit. Below are some of the questions he might ask:

  1. “What is in my data?”
  2. “How can I increase my profit?”
  3. “Show me products where our price is higher than the competitor’s price. Sort by the difference.”
  4. “Show me the differences for products that are marketed towards well-off customers.”

Each of these questions require a very different response, and each could have a variety of good responses – which makes evaluation tricky. Let’s take a deeper look at each question, and how we would typically evaluate each one.

  1. “What is in my data?”

Vague in nature, this question signals that Colin lacks a strong idea of what exactly his data warehouse contains and what Numbers Station is capable of answering. A good response might provide descriptions of tables, columns, dimensions, and/or metrics available within his data warehouse along with suggestions of questions that can be answered from his data. A poor response would be to answer with something entirely generic like “Your data contains information about sales.” 

  1. “How can I increase my profit?”

While still high-level like the previous question, this question is less exploratory and has a slightly more specific goal in mind. However, to reach an insight, there are still multiple ways of attacking the question that may require different cuts of data. A good response might propose several ways Colin can approach the question based on what data is available, such as “Which products do I run out of frequently?” or “Which products return a negative profit?”. A poor response would propose questions that aren’t answerable with the available data, frustrating Colin in the process. 

  1. “Show me products where our price is higher than the competitor’s price. Sort by the difference.”

This question is one that can be answered directly with a SQL query against the data warehouse.. However, there is still some ambiguity in the question. Does Colin only want to see the differences or does he want to see the prices as well? In what order would he like to sort? A good response would display one of these options. Standard Text-to-SQL evaluation methods like exact match or execution accuracy aren’t capable of handling this kind of ambiguity. 

  1. “Show me the differences for products that are marketed towards well-off customers.”

This question requires understanding that Colin wants to ask the same question as before (“Show me products where our price is higher than the competitor’s price. Sort by the difference.”) but with a slight modification to only show data for products that are marketed towards “well-off customers”. This question can be directly answered with a SQL query if the data warehouse contains data on which customers are well-off. If it doesn’t, a good response would pull the human into the loop and request the missing information from Colin. In our evaluation, we would score highly on responses that ask for clarification or state that information is missing, and penalize responses that generate a random filter.

Enabling a multi-agent architecture involves thoughtful evaluation of all participating agents. At Numbers Station, we have built a proprietary framework that allows us to constantly evaluate the various aspects of our system – so our users can unlock new and accurate business insights in minutes instead of months.

Truly integrating AI into your data is a lot more complicated than just Text2SQL, and Numbers Station uses a full stack solution to take you past just an LLM.

Uncover the method to our framework by contacting us