Maureen Daum

How Numbers Station uses SQLGlot

At Numbers Station, we leverage SQLGlot to streamline our SQL analysis, using it to parse, transform, and normalize SQL across diverse data warehouses. In this post, we’ll share practical insights into how SQLGlot enables our team to write dialect-agnostic SQL transformation logic, making complex query handling more efficient.

At Numbers Station, we use Generative AI to transform natural language questions into SQL queries. These queries can be quite complex, often over many tables and views. Beyond generating SQL, we also analyze sample queries to initialize our Knowledge Layer. These tasks involve many steps of SQL analysis and transformations. 

One integral tool that helps us with our SQL analysis is SQLGlot. SQLGlot is a library for parsing, transforming, and transpiling SQL. Numbers Station works with many different data warehouses, all of which use slightly different syntax. SQLGlot allows us to write common parsing and transformation logic over dialect-agnostic expressions.

While SQLGlot’s documentation is extremely thorough, we want to share a few practical examples of how we use SQLGlot in our codebase.

Dialect-independent query transformation

We occasionally want to run a simplified query to check for runtime errors or data types. To do this we start with a target query and remove expensive operators (such as ORDER BY) and limit the output (using LIMIT 1). Doing this robustly via string editing is difficult because queries may or may not already contain a limit clause, and with string parsing it’s impossible to tell whether these clauses appear in a common table expression (CTE), subquery, or the parent SELECT clause. Further adding to the challenge, different dialects specify LIMIT differently. Snowflake, Bigquery, and Databricks use the following syntax:

SELECT ...
FROM ...
WHERE ...
LIMIT n

while Teradata uses a TOP clause:

SELECT TOP n ...
FROM ...
WHERE ...

SQLGlot allows us to apply the same transformations regardless of the target warehouse.

To add or replace an existing limit with a LIMIT 1 (or the dialect-equivalent) to a SQL expression, we use the following transformation. The dialect can be any of the dialects that SQLGlot supports.

from typing import cast

import sqlglot

def add_or_replace_limit(sql: str, dialect: str) -> str:
"""Add a LIMIT 1 to a sql statement, or replace the current LIMIT clause with LIMIT 1

Example Input:
    WITH foo AS (SELECT foo FROM table LIMIT 10) SELECT * FROM FOO
Example Output:
    WITH foo AS (SELECT foo FROM table LIMIT 10) SELECT * FROM FOO LIMIT 1

Example Input:
    WITH foo AS (SELECT foo FROM table LIMIT 10) SELECT * FROM FOO LIMIT 10
Example Output:
    WITH foo AS (SELECT foo FROM table LIMIT 10) SELECT * FROM FOO LIMIT 1
"""

    limit1_sql = (
        cast(
            sqlglot.exp.Select,
            sqlglot.parse_one(sql, dialect=dialect),
        )
        .limit(1)
        .sql(dialect=dialect)
    )
    return limit1_sql

To remove ORDER BY clauses from a SQL fragment, we use one of the following transformations, depending on whether we want to remove all ORDER BY clauses or just the one in the parent SELECT clause.

def remove_order_by_from_parent_select(sql: str, dialect: str) -> str:
    """Remove the ORDER BY clause from the parent SELECT clause if it exists

    Example Input:
        SELECT foo FROM (SELECT foo FROM table WHERE bar=2 ORDER BY bar) AS f2 ORDER BY foo
    Example Output:
        SELECT foo FROM (SELECT foo FROM table WHERE bar=2 ORDER BY bar) AS f2
    """
    parsed_sql = sqlglot.parse_one(sql, dialect=dialect)
    if "order" in parsed_sql.args:
        del parsed_sql.args["order"]
    return parsed_sql.sql(dialect=dialect)


def remove_order_by_from_all_select(sql: str, dialect: str) -> str:
    """Remove the ORDER BY clause from all SELECT clauses if it exists

    Example Input:
        SELECT foo FROM (SELECT foo FROM table WHERE bar=2 ORDER BY bar) AS f2 ORDER BY foo
    Example Output:
        SELECT foo FROM (SELECT foo FROM table WHERE bar=2) AS f2
    """
    def _drop_order_by(node: sqlglot.exp.Expression) -> sqlglot.exp.Expression:
        """Drop ORDER BY nodes """  
        if isinstance(node, sqlglot.exp.Order):
            return None
        return node
	
    parsed_sql = sqlglot.parse_one(sql, dialect=dialect)
    return parsed_sql.transform(_drop_order_by).sql(dialect=dialect)

Adding quotes around identifiers

Some warehouses like Snowflake are case-sensitive and require that identifiers with special characters be surrounded by double quotes. However, SQL generated by a model may not be properly quoted nor have user-friendly aliases. We handle this by deterministically adding quotes when needed.

The following code snippet shows how we quote all column, table, and alias references (represented as Identifiers). Again, this logic can be shared across warehouses that use different quote characters (e.g., Snowflake uses double quotes while Databricks uses backticks).

def quote_identifiers(sql: str, dialect: str) -> str:
    """Quote all identifiers in the SQL"""
    def _quote_identifiers(node: sqlglot.exp.Expression) -> sqlglot.exp.Expression:
        """Quote all identifiers"""
        if isinstance(node, sqlglot.exp.Identifier):
            node.args["quoted"] = True
        return node
    
    return (
        sqlglot.parse_one(sql, dialect=dialect)
        .transform(_quote_identifiers)
        .sql(dialect=dialect)
    )

To create an identifier from a model-generated name, we use the following snippet. This identifier can be used to build any of sqlglot.exp.TableAlias, sqlglot.exp.Column, or sqlglot.exp.Table.

identifier_from_string: sqlglot.exp.Identifier = sqlglot.exp.to_identifier(model_generated_name, quoted=True)

Qualifying SQL to analyze tables and columns

Numbers Station ingests sample SQL queries to initialize our Knowledge Layer. To do so effectively, we need to analyze which tables and columns are used in each expression. The input SQL isn’t always clean, so we use SQLGlot to standardize it and qualify all tables and columns used. As shown below, SQLGlot’s qualify makes it simple for us to link expressions to the base tables and views.

def qualify_sql(sql: str, dialect: str, database_schema: dict[str, dict[str, str]]) -> str:
    """Qualify all identifiers.

    Example input:
        WITH table1_cte AS (
        SELECT
            a AS a1,
            b AS b1
        FROM table1
        WHERE
            c > 5
        )
        SELECT
        a1,
        b1,
        c2 AS c2_alias,
        c2_alias * 2 AS c2_times2
        FROM table1_cte, table2

    Example database_schema:
        {
            "table": {
                "a": "INT",
                "b": "INT",
                "c": "INT",
            },
            "table2": {
                "c2": "int",
            }
        }

    Example output (for `dialect="databricks"`):
        WITH `table1_cte` AS (
        SELECT
            `table1`.`a` AS `a1`,
            `table1`.`b` AS `b1`
        FROM `table1` AS `table1`
        WHERE
            `table1`.`c` > 5
        )
        SELECT
        `table1_cte`.`a1` AS `a1`,
        `table1_cte`.`b1` AS `b1`,
        `table2`.`c2` AS `c2_alias`,
        `table2`.`c2` * 2 AS `c2_times2`
        FROM `table1_cte` AS `table1_cte`, `table2` AS `table2`
    """
    parsed = sqlglot.parse_one(sql, dialect=dialect)
    qualified = sqlglot.optimizer.qualify.qualify(
        parsed,
        dialect=dialect,
        schema=database_schema,
        infer_schema=True,
        quote_identifiers=True,
    )
    return qualified.sql(dialect=dialect, pretty=True)

Conclusion

We hope that these examples inspire you to check out SQLGlot and use it in your own work if you interact with SQL queries or multiple data warehouses. Its utilities for parsing, modifying, and transpiling SQL simplify the logic needed to work with even highly complex queries.