Migrating SAS to BigQuery: DATA Steps and PROC SQL to BigQuery SQL and Dataform

SAS has been the dominant analytics platform in regulated industries for decades, powering everything from clinical trial analysis in pharma to fraud detection in banking. Yet organizations are increasingly finding that SAS's per-seat licensing, proprietary runtime, and single-server architecture create bottlenecks as data volumes grow into terabytes and petabytes. Google BigQuery offers a serverless, petabyte-scale SQL analytics engine with consumption-based pricing that eliminates infrastructure management entirely.

This guide provides a detailed technical mapping of every major SAS concept — DATA steps, PROC SQL, macros, scheduling, statistical procedures, and reporting — to their BigQuery-native equivalents, including BigQuery SQL, Dataform (SQL workflows), Scheduled Queries, Cloud Composer, and BigQuery ML. Whether you are migrating a handful of SAS programs or thousands, understanding these mappings is the foundation for a successful migration.

SAS Architecture vs. BigQuery Architecture

SAS programs execute on a single SAS server (or SAS Grid for distributed workloads). The SAS runtime reads data from SAS datasets (sas7bdat files), RDBMS connections via libname engines, or flat files. All processing — DATA step row-by-row logic, PROC SQL queries, statistical procedures — runs on the SAS server's CPU and memory. Results are written back to SAS datasets or external targets. SAS scheduling is handled by Platform LSF, Control-M, or cron jobs invoking SAS batch programs.

BigQuery is a fully serverless, columnar analytics engine. There is no server to provision, patch, or scale. Storage and compute are fully separated — data lives in BigQuery's managed columnar storage (Capacitor format), and compute is allocated dynamically per query from a shared pool (on-demand) or reserved slots. BigQuery SQL is ANSI-compliant with extensions for nested/repeated data (STRUCT, ARRAY), geographic functions, ML model training, and more. Dataform provides SQL-based workflow orchestration with dependency management, testing, and documentation built in.

SAS Concept	BigQuery Equivalent	Notes
SAS DATA step	BigQuery SQL with window functions / scripting	Row-by-row logic replaced by set-based SQL with LAG, LEAD, running totals
PROC SQL	BigQuery SQL	Near-direct translation; BigQuery supports CTEs, subqueries, JOINs natively
SAS Macros (%macro/%mend)	Dataform SQLX macros / JavaScript	Dataform uses JavaScript for dynamic SQL generation, SQLX for templating
SAS datasets (sas7bdat)	BigQuery tables (columnar)	Automatic compression, partitioning, clustering; no file management
SAS formats (PUT/INPUT)	CASE expressions / FORMAT_* functions	FORMAT_DATE, FORMAT_TIMESTAMP; CASE for custom value mappings
PROC MEANS / PROC FREQ	BigQuery aggregate functions	AVG, STDDEV, COUNTIF, APPROX_COUNT_DISTINCT; window aggregates
PROC REPORT / PROC TABULATE	BigQuery Connected Sheets / Looker	Connected Sheets for Excel-like reporting; Looker for dashboards
SAS libname	BigQuery dataset	Logical grouping of tables with access controls at the dataset level
SAS scheduling (cron/LSF)	Scheduled Queries / Cloud Composer	Scheduled Queries for simple jobs; Cloud Composer (Airflow) for DAGs
SAS/STAT procedures	BigQuery ML	CREATE MODEL for regression, classification, clustering, time series

SAS to BigQuery migration — automated end-to-end by MigryX

DATA Step to BigQuery SQL: The Fundamental Shift

The SAS DATA step is the most distinctive feature of the SAS language. It processes data row by row through an implicit loop, with a Program Data Vector (PDV) that maintains state across rows. Features like RETAIN, arrays, FIRST./LAST. processing, and OUTPUT statements have no direct equivalent in standard SQL — but BigQuery's window functions, scripting blocks, and ARRAY/STRUCT types provide equivalent functionality in a set-based paradigm.

RETAIN and Running Totals

In SAS, RETAIN preserves a variable's value across iterations of the DATA step loop, enabling running totals, counters, and state tracking. In BigQuery, window functions with SUM() OVER (ORDER BY ...) replace RETAIN for running calculations.

/* SAS: Running total with RETAIN and array processing */
data work.running_totals;
    set saslib.daily_sales;
    by region;
    retain cumulative_revenue 0;
    array quarter_sales{4} q1_sales q2_sales q3_sales q4_sales;
    array quarter_targets{4} q1_target q2_target q3_target q4_target;

    if first.region then cumulative_revenue = 0;
    cumulative_revenue + daily_revenue;

    total_annual_sales = sum(of quarter_sales{*});
    pct_of_target = total_annual_sales /
                    sum(of quarter_targets{*}) * 100;

    if last.region then do;
        region_complete = 1;
        output;
    end;
run;

-- BigQuery SQL equivalent: window functions replace RETAIN
-- ARRAY and STRUCT handle multi-column operations
WITH daily_with_running AS (
    SELECT
        *,
        SUM(daily_revenue) OVER (
            PARTITION BY region
            ORDER BY sale_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        (q1_sales + q2_sales + q3_sales + q4_sales) AS total_annual_sales,
        SAFE_DIVIDE(
            (q1_sales + q2_sales + q3_sales + q4_sales),
            (q1_target + q2_target + q3_target + q4_target)
        ) * 100 AS pct_of_target,
        ROW_NUMBER() OVER (
            PARTITION BY region ORDER BY sale_date DESC
        ) AS rn
    FROM `project.dataset.daily_sales`
)
SELECT
    * EXCEPT(rn),
    1 AS region_complete
FROM daily_with_running
WHERE rn = 1;  -- last row per region

FIRST. and LAST. Processing

SAS's FIRST.variable and LAST.variable automatic variables identify the first and last observations within a BY group. BigQuery achieves identical functionality with ROW_NUMBER(), LAG(), and LEAD() window functions.

/* SAS: Deduplication with FIRST./LAST. */
proc sort data=saslib.transactions;
    by customer_id transaction_date;
run;

data work.first_last_purchase;
    set saslib.transactions;
    by customer_id;
    if first.customer_id then first_purchase_date = transaction_date;
    retain first_purchase_date;
    if last.customer_id then do;
        last_purchase_date = transaction_date;
        output;
    end;
run;

-- BigQuery SQL equivalent
SELECT
    customer_id,
    MIN(transaction_date) OVER (PARTITION BY customer_id) AS first_purchase_date,
    MAX(transaction_date) OVER (PARTITION BY customer_id) AS last_purchase_date
FROM `project.dataset.transactions`
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id ORDER BY transaction_date DESC
) = 1;

BigQuery's QUALIFY clause is a powerful shorthand that filters the result of window functions directly, eliminating the need for a subquery wrapper. It replaces the common SAS pattern of DATA step deduplication with FIRST./LAST. variables in a single, readable SQL statement.

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

PROC SQL to BigQuery SQL

PROC SQL is already ANSI SQL, making it the most straightforward SAS component to migrate. However, SAS PROC SQL has several extensions and quirks that require attention during conversion: automatic remerging (using aggregates without GROUP BY), the CALCULATED keyword, PROC SQL macro variable creation (INTO :), and SAS-specific functions.

Complex PROC SQL Query Migration

/* SAS PROC SQL: Complex analytical query */
proc sql;
    create table work.customer_analytics as
    select
        a.customer_id,
        a.customer_name,
        a.segment,
        b.total_orders,
        b.total_revenue,
        b.avg_order_value,
        calculated total_revenue / c.segment_revenue * 100
            as pct_of_segment format=8.2,
        case
            when calculated total_revenue > 50000 then 'Platinum'
            when calculated total_revenue > 20000 then 'Gold'
            when calculated total_revenue > 5000  then 'Silver'
            else 'Bronze'
        end as loyalty_tier,
        d.last_order_date,
        intck('day', d.last_order_date, today()) as days_since_last_order
    from saslib.customers a
    left join (
        select customer_id,
               count(*) as total_orders,
               sum(order_amount) as total_revenue,
               mean(order_amount) as avg_order_value
        from saslib.orders
        where order_date >= intnx('year', today(), -2, 'B')
        group by customer_id
    ) b on a.customer_id = b.customer_id
    left join (
        select segment, sum(order_amount) as segment_revenue
        from saslib.customers x
        inner join saslib.orders y on x.customer_id = y.customer_id
        group by segment
    ) c on a.segment = c.segment
    left join (
        select customer_id, max(order_date) as last_order_date
        from saslib.orders
        group by customer_id
    ) d on a.customer_id = d.customer_id
    order by calculated total_revenue desc;
quit;

-- BigQuery SQL equivalent
CREATE OR REPLACE TABLE `project.dataset.customer_analytics` AS
WITH order_summary AS (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(order_amount) AS total_revenue,
        AVG(order_amount) AS avg_order_value,
        MAX(order_date) AS last_order_date
    FROM `project.dataset.orders`
    WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
    GROUP BY customer_id
),
segment_totals AS (
    SELECT
        c.segment,
        SUM(o.order_amount) AS segment_revenue
    FROM `project.dataset.customers` c
    INNER JOIN `project.dataset.orders` o
        ON c.customer_id = o.customer_id
    GROUP BY c.segment
)
SELECT
    a.customer_id,
    a.customer_name,
    a.segment,
    COALESCE(b.total_orders, 0) AS total_orders,
    COALESCE(b.total_revenue, 0) AS total_revenue,
    b.avg_order_value,
    ROUND(SAFE_DIVIDE(b.total_revenue, c.segment_revenue) * 100, 2)
        AS pct_of_segment,
    CASE
        WHEN b.total_revenue > 50000 THEN 'Platinum'
        WHEN b.total_revenue > 20000 THEN 'Gold'
        WHEN b.total_revenue > 5000  THEN 'Silver'
        ELSE 'Bronze'
    END AS loyalty_tier,
    b.last_order_date,
    DATE_DIFF(CURRENT_DATE(), b.last_order_date, DAY) AS days_since_last_order
FROM `project.dataset.customers` a
LEFT JOIN order_summary b ON a.customer_id = b.customer_id
LEFT JOIN segment_totals c ON a.segment = c.segment
ORDER BY b.total_revenue DESC;

Key Translation Rules: PROC SQL to BigQuery

SAS PROC SQL Feature	BigQuery Equivalent	Example
CALCULATED keyword	CTE or repeat expression	Use WITH clause to define computed columns
INTO :macro_var	DECLARE variable; SET variable = (SELECT ...)	BigQuery scripting variables
INTCK('day', a, b)	DATE_DIFF(b, a, DAY)	Note: argument order is reversed
INTNX('month', dt, 1)	DATE_ADD(dt, INTERVAL 1 MONTH)	DATE_SUB for negative intervals
PUT(var, format.)	FORMAT_DATE / CAST / CASE	Custom formats become CASE expressions
INPUT(char, numfmt.)	CAST(char AS NUMERIC)	SAFE_CAST for error-tolerant conversion
CATX('-', a, b, c)	CONCAT_WS('-', a, b, c)	Direct equivalent available in BigQuery
COMPRESS(str)	REGEXP_REPLACE(str, r'\s', '')	REPLACE for specific character removal
Automatic remerging	Window function / self-join	SUM(x) OVER () for total alongside detail

SAS Macros to Dataform SQLX and JavaScript

SAS macros generate dynamic SAS code at compile time through macro variable substitution and macro program execution. They are used for parameterization, looping over datasets, conditional code generation, and building reusable utility libraries. Dataform, Google's SQL workflow tool for BigQuery, provides an equivalent system through SQLX (SQL with JavaScript templating) and JavaScript definitions.

Parameterized Macro Migration

/* SAS Macro: Parameterized aggregation */
%macro aggregate_by_period(input_ds, output_ds, date_col, period);
    proc sql;
        create table &output_ds as
        select
            %if &period = month %then %do;
                intnx('month', &date_col, 0, 'B') as period_start format=date9.,
            %end;
            %else %if &period = quarter %then %do;
                intnx('quarter', &date_col, 0, 'B') as period_start format=date9.,
            %end;
            count(*) as record_count,
            sum(amount) as total_amount,
            mean(amount) as avg_amount
        from &input_ds
        group by calculated period_start
        order by period_start;
    quit;
%mend;

%aggregate_by_period(saslib.transactions, work.monthly_agg, trans_date, month);
%aggregate_by_period(saslib.transactions, work.quarterly_agg, trans_date, quarter);

-- Dataform SQLX equivalent (definitions/aggregate_by_period.sqlx)
-- JavaScript function for reusable logic
config {
    type: "operations"
}

js {
function aggregateByPeriod(inputTable, dateCol, period) {
    let truncExpr;
    if (period === 'month') {
        truncExpr = `DATE_TRUNC(${dateCol}, MONTH)`;
    } else if (period === 'quarter') {
        truncExpr = `DATE_TRUNC(${dateCol}, QUARTER)`;
    }
    return `
        SELECT
            ${truncExpr} AS period_start,
            COUNT(*) AS record_count,
            SUM(amount) AS total_amount,
            AVG(amount) AS avg_amount
        FROM ${inputTable}
        GROUP BY period_start
        ORDER BY period_start
    `;
}
}

-- monthly_agg.sqlx
config { type: "table" }
${aggregateByPeriod(ref("transactions"), "trans_date", "month")}

-- quarterly_agg.sqlx
config { type: "table" }
${aggregateByPeriod(ref("transactions"), "trans_date", "quarter")}

SAS Formats to BigQuery Expressions

SAS formats (applied with FORMAT and PUT statements) control how values are displayed without changing the underlying data. SAS also uses informats (INPUT) for reading data. BigQuery has no direct format concept — value mappings become CASE expressions, and display formatting is handled by FORMAT_DATE, FORMAT_TIMESTAMP, and CAST functions.

/* SAS: Custom format for status codes */
proc format;
    value $status_fmt
        'A' = 'Active'
        'I' = 'Inactive'
        'P' = 'Pending'
        'S' = 'Suspended'
        other = 'Unknown';
    value revenue_tier
        low -< 10000 = 'Small'
        10000 -< 100000 = 'Medium'
        100000 - high = 'Enterprise';
run;

data work.formatted;
    set saslib.accounts;
    status_label = put(status_code, $status_fmt.);
    tier = put(annual_revenue, revenue_tier.);
run;

-- BigQuery SQL equivalent: CASE expressions replace SAS formats
-- Option 1: Inline CASE
SELECT
    *,
    CASE status_code
        WHEN 'A' THEN 'Active'
        WHEN 'I' THEN 'Inactive'
        WHEN 'P' THEN 'Pending'
        WHEN 'S' THEN 'Suspended'
        ELSE 'Unknown'
    END AS status_label,
    CASE
        WHEN annual_revenue < 10000 THEN 'Small'
        WHEN annual_revenue < 100000 THEN 'Medium'
        ELSE 'Enterprise'
    END AS tier
FROM `project.dataset.accounts`;

-- Option 2: Reusable UDF for frequently-used formats
CREATE OR REPLACE FUNCTION `project.dataset.status_format`(code STRING)
RETURNS STRING
AS (
    CASE code
        WHEN 'A' THEN 'Active'
        WHEN 'I' THEN 'Inactive'
        WHEN 'P' THEN 'Pending'
        WHEN 'S' THEN 'Suspended'
        ELSE 'Unknown'
    END
);

SAS Datasets to BigQuery Tables: Storage Architecture

SAS datasets are proprietary files (sas7bdat) stored on the SAS server's filesystem or in a shared directory. BigQuery tables are stored in BigQuery's managed columnar format with automatic compression, encryption, and replication. The migration involves both data transfer and schema optimization.

Partitioning and Clustering

BigQuery's partitioned and clustered tables replace SAS's index and sort-based optimization strategies. Partitioning physically divides a table by a date/timestamp column (or integer range), reducing the amount of data scanned per query. Clustering sorts data within partitions by up to four columns, enabling further scan reduction.

-- BigQuery: Optimized table with partitioning and clustering
-- Replaces SAS indexed datasets with automatic query optimization
CREATE OR REPLACE TABLE `project.dataset.transactions`
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id, product_category
AS
SELECT
    transaction_id,
    customer_id,
    transaction_date,
    product_category,
    amount,
    quantity,
    store_id
FROM `project.staging.raw_transactions`;

-- Queries filtering on transaction_date scan only relevant partitions
-- Queries also filtering on customer_id or product_category benefit
-- from clustering (data is sorted within each partition)

Nested and Repeated Data with STRUCT and ARRAY

BigQuery supports nested and repeated fields through STRUCT and ARRAY types, enabling denormalized schemas that would require multiple SAS datasets and merges. This is a paradigm shift — BigQuery encourages denormalization for query performance.

-- BigQuery: Nested data eliminates multi-dataset JOINs
-- Instead of separate SAS datasets for orders and line items:
CREATE OR REPLACE TABLE `project.dataset.orders_nested` AS
SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    o.total_amount,
    ARRAY_AGG(STRUCT(
        li.line_item_id,
        li.product_id,
        li.product_name,
        li.quantity,
        li.unit_price
    )) AS line_items
FROM `project.dataset.orders` o
JOIN `project.dataset.line_items` li ON o.order_id = li.order_id
GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount;

PROC MEANS and PROC FREQ to BigQuery Aggregates

SAS's statistical procedures produce summary statistics with rich output tables. BigQuery provides the same analytical power through aggregate functions, window functions, and BigQuery ML for advanced statistics.

/* SAS: PROC MEANS with CLASS and OUTPUT */
proc means data=saslib.sales n mean std min max p25 median p75;
    class region product_line;
    var revenue units_sold margin_pct;
    output out=work.sales_stats
        n= mean= std= min= max= / autoname;
run;

-- BigQuery SQL equivalent with APPROX_QUANTILES
SELECT
    region,
    product_line,
    COUNT(*) AS n,
    AVG(revenue) AS revenue_mean,
    STDDEV(revenue) AS revenue_std,
    MIN(revenue) AS revenue_min,
    MAX(revenue) AS revenue_max,
    APPROX_QUANTILES(revenue, 4)[OFFSET(1)] AS revenue_p25,
    APPROX_QUANTILES(revenue, 4)[OFFSET(2)] AS revenue_median,
    APPROX_QUANTILES(revenue, 4)[OFFSET(3)] AS revenue_p75,
    AVG(units_sold) AS units_sold_mean,
    STDDEV(units_sold) AS units_sold_std,
    AVG(margin_pct) AS margin_pct_mean,
    STDDEV(margin_pct) AS margin_pct_std
FROM `project.dataset.sales`
GROUP BY GROUPING SETS (
    (region, product_line),
    (region),
    (product_line),
    ()
)
ORDER BY region, product_line;

SAS/STAT Procedures to BigQuery ML

BigQuery ML allows creating and executing machine learning models directly in BigQuery using SQL, replacing SAS/STAT procedures like PROC REG, PROC LOGISTIC, PROC CLUSTER, and PROC FORECAST without exporting data or managing separate compute environments.

/* SAS: PROC LOGISTIC for churn prediction */
proc logistic data=saslib.customers descending;
    class segment region / param=ref;
    model churned = tenure monthly_spend support_tickets
                    segment region;
    output out=work.predictions p=churn_probability;
run;

-- BigQuery ML equivalent
-- Step 1: Create the model
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    auto_class_weights = TRUE,
    l2_reg = 0.01
) AS
SELECT
    tenure,
    monthly_spend,
    support_tickets,
    segment,
    region,
    churned
FROM `project.dataset.customers`
WHERE split_col = 'TRAIN';

-- Step 2: Generate predictions (replaces PROC LOGISTIC output)
SELECT
    customer_id,
    predicted_churned AS churn_prediction,
    predicted_churned_probs[OFFSET(0)].prob AS churn_probability
FROM ML.PREDICT(
    MODEL `project.dataset.churn_model`,
    (SELECT * FROM `project.dataset.customers` WHERE split_col = 'SCORE')
);

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

PROC REPORT to BigQuery Connected Sheets and Looker

SAS PROC REPORT and PROC TABULATE generate formatted reports with computed columns, across variables, and traffic-lighting. In the BigQuery ecosystem, reporting is separated from compute: BigQuery produces the analytical results, and connected visualization tools render the reports. BigQuery Connected Sheets allows users to work with BigQuery data directly in Google Sheets, while Looker provides enterprise dashboards and self-service analytics.

SAS Libname to BigQuery Datasets

A SAS libname assigns a logical name to a physical storage location (directory, database schema, etc.). A BigQuery dataset serves the same purpose — it is a logical container for tables, views, and routines within a project. Access controls are applied at the dataset level, and datasets can be organized by environment (dev, staging, prod) or domain (finance, marketing, operations).

/* SAS: Libname assignments */
libname raw '/data/raw';
libname silver '/data/silver';
libname gold '/data/gold';

-- BigQuery: Datasets as logical containers
-- Created once via DDL or Terraform
CREATE SCHEMA IF NOT EXISTS `project.raw`
    OPTIONS (location = 'US');
CREATE SCHEMA IF NOT EXISTS `project.silver`
    OPTIONS (location = 'US');
CREATE SCHEMA IF NOT EXISTS `project.gold`
    OPTIONS (location = 'US');

-- Dataform manages dataset references with environments
-- dataform.json
-- { "defaultSchema": "silver", "defaultDatabase": "project" }

SAS Scheduling to Scheduled Queries and Cloud Composer

SAS batch programs are typically scheduled via cron, Control-M, or Platform LSF. BigQuery provides two native scheduling mechanisms: Scheduled Queries for simple recurring SQL, and Cloud Composer (managed Apache Airflow) for complex DAGs with dependencies, retries, and cross-system orchestration.

Scheduled Queries

-- BigQuery Scheduled Query: Daily aggregation
-- Configured via Console or bq CLI
-- Replaces: cron job invoking "sas -batch daily_agg.sas"

-- Schedule: Every day at 06:00 UTC
-- Destination: project.gold.daily_revenue
-- Write preference: WRITE_TRUNCATE

SELECT
    DATE(transaction_date) AS report_date,
    region,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM `project.silver.transactions`
WHERE DATE(transaction_date) = @run_date
GROUP BY report_date, region;

Cloud Composer for Complex Pipelines

# Cloud Composer (Airflow) DAG replacing SAS batch scheduling
# Replaces a multi-step SAS batch job with dependencies

from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryInsertJobOperator
)
from datetime import datetime

with DAG(
    'sas_migration_daily_pipeline',
    schedule_interval='0 6 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=['sas-migration']
) as dag:

    load_raw = BigQueryInsertJobOperator(
        task_id='load_raw_data',
        configuration={
            'query': {
                'query': 'CALL `project.raw.load_daily_files`(@run_date)',
                'useLegacySql': False
            }
        }
    )

    transform_silver = BigQueryInsertJobOperator(
        task_id='transform_to_silver',
        configuration={
            'query': {
                'query': 'CALL `project.silver.transform_transactions`(@run_date)',
                'useLegacySql': False
            }
        }
    )

    build_gold = BigQueryInsertJobOperator(
        task_id='build_gold_aggregates',
        configuration={
            'query': {
                'query': 'CALL `project.gold.build_daily_summary`(@run_date)',
                'useLegacySql': False
            }
        }
    )

    load_raw >> transform_silver >> build_gold

Data Ingestion: Replacing SAS File Reading

SAS reads data from flat files (CSV, fixed-width), SAS datasets, databases (via libname engines), and increasingly from cloud storage. BigQuery provides multiple ingestion methods: LOAD DATA for batch loads, BigQuery Storage Write API for streaming, external tables for query-in-place, and BigQuery Data Transfer Service for SaaS connectors.

-- BigQuery: Load CSV data (replaces SAS INFILE/INPUT)
LOAD DATA INTO `project.raw.daily_transactions`
FROM FILES (
    format = 'CSV',
    uris = ['gs://company-data/daily/transactions_*.csv'],
    skip_leading_rows = 1
);

-- External table for query-in-place (replaces SAS filename + INFILE)
CREATE OR REPLACE EXTERNAL TABLE `project.raw.external_log_data`
OPTIONS (
    format = 'JSON',
    uris = ['gs://company-logs/2026/04/*'],
    max_staleness = INTERVAL 1 HOUR
);

How MigryX Automates SAS to BigQuery Migration
AST-Based Deterministic Parsing — MigryX parses SAS programs into a full abstract syntax tree, capturing DATA step logic (RETAIN, arrays, FIRST./LAST.), PROC SQL, macro definitions, and macro invocations. This is deterministic parsing, not regex pattern matching or AI-only translation, achieving +95% parser accuracy across production SAS codebases.
Column-Level Lineage — MigryX traces every column from source to target through DATA steps, PROC SQL queries, macro expansions, and format applications, generating complete STTM (Source-to-Target Mapping) documentation automatically.
Multi-Target Output — The same parsed SAS AST can generate BigQuery SQL, Dataform SQLX, Python (pandas/PySpark), or other targets. Switching from BigQuery to Databricks or Snowflake does not require re-parsing.
Merlin AI — MigryX's AI assistant handles ambiguous constructs, suggests BigQuery-native optimizations (partitioning, clustering, nested schemas), and provides migration-specific recommendations for each SAS program.
On-Premise / Air-Gapped Deployment — MigryX runs entirely within your infrastructure. No SAS source code or data leaves your network, meeting the strictest compliance requirements in banking, healthcare, and government.

The most common SAS-to-BigQuery migration error is attempting line-by-line translation of DATA step logic. SAS DATA steps are imperative and row-by-row; BigQuery SQL is declarative and set-based. MigryX's AST-based approach understands the semantic intent of DATA step logic and generates idiomatic BigQuery SQL using window functions, CTEs, and scripting — not a mechanical transliteration.

Migration Architecture: End-to-End Pipeline

A typical SAS-to-BigQuery migration follows a phased approach that addresses code conversion, data migration, validation, and operational cutover. The architecture involves several GCP services working together.

Discovery and Assessment — Inventory all SAS programs, datasets, macros, and schedules. MigryX's scanner analyzes the complete SAS codebase and generates a migration complexity report with effort estimates per program.
Data Migration — Export SAS datasets to Parquet or CSV, stage in Google Cloud Storage, and load into BigQuery raw tables. Use BigQuery Data Transfer Service for ongoing database connections.
Code Conversion — Convert SAS programs to BigQuery SQL and Dataform SQLX using MigryX's automated translation engine. Review and refine output with MigryX's Merlin AI for optimization suggestions.
Validation — Run parallel execution of SAS and BigQuery pipelines, comparing row counts, aggregates, and sample records. MigryX generates validation queries automatically from the column-level lineage map.
Scheduling and Orchestration — Replace SAS batch scheduling with BigQuery Scheduled Queries (simple jobs) or Cloud Composer DAGs (complex pipelines with dependencies). Dataform provides built-in scheduling for SQL workflow orchestration.
Cutover and Decommission — Redirect downstream consumers to BigQuery outputs, monitor for parity, and decommission SAS licenses incrementally.

Comparison: SAS vs. BigQuery Operational Model

Dimension	SAS	BigQuery
Pricing	Annual per-seat license ($5K-$50K+ per user)	Pay-per-query ($6.25/TB) or flat-rate slots
Scaling	Vertical (bigger server) or SAS Grid	Automatic, serverless, unlimited concurrency
Infrastructure	On-premise servers or SAS Viya cloud	Fully managed, zero administration
Storage Format	Proprietary sas7bdat	Managed columnar (Capacitor) with auto-optimization
Query Language	SAS language + PROC SQL	ANSI SQL with extensions (BigQuery SQL)
ML/Statistics	SAS/STAT, SAS/ETS, SAS Enterprise Miner	BigQuery ML (SQL-based), Vertex AI integration
Scheduling	cron, Control-M, LSF	Scheduled Queries, Cloud Composer, Dataform
Data Governance	SAS metadata server	Dataplex, Data Catalog, column-level security
Version Control	Typically file-based (poor Git integration)	Dataform integrates with Git natively
Collaboration	SAS Enterprise Guide (desktop)	BigQuery Console, Connected Sheets, Colab notebooks

Migrating from SAS to BigQuery is not simply a code translation exercise — it is a fundamental shift from a proprietary, server-based, row-by-row processing model to a serverless, set-based, columnar analytics platform. The SAS DATA step's imperative loop gives way to BigQuery's declarative SQL with window functions. SAS macro libraries become Dataform SQLX templates with JavaScript. SAS batch scheduling yields to Cloud Composer DAGs and Scheduled Queries. And SAS/STAT procedures are replaced by BigQuery ML models trained directly in SQL.

The payoff is substantial: elimination of SAS licensing costs, serverless scaling that handles petabyte workloads without infrastructure planning, native integration with the Google Cloud data ecosystem (Dataplex for governance, Looker for visualization, Vertex AI for advanced ML), and a SQL-centric approach that dramatically expands the pool of talent who can work with the platform.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Production-ready output: MigryX generates code that passes code review and runs in production — not prototype-quality output that needs weeks of cleanup.
Platform optimization: Converted code leverages target platform-specific features for maximum performance and cost efficiency.
25+ source technologies: Whether migrating from SAS, Informatica, DataStage, SSIS, or any of 25+ legacy technologies, MigryX handles it.
Automated documentation: Every conversion decision is documented with before/after code mappings and transformation rationale.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from SAS to BigQuery?

See how MigryX converts SAS DATA steps, PROC SQL, and macro libraries to production-ready BigQuery SQL, Dataform SQLX, and Cloud Composer pipelines — with full column-level lineage.

Explore BigQuery Migration Schedule a Demo