Skip to content

Tables

Sumatra tables provide a way to enrich events with externally-loaded batch data. A lookup feature fetches the table row for the given key and returns the requested field values.

Tables are static, meaning that the contents of the table are not updated by some ongoing process. Instead, a table may be updated by uploading a new version of the table, then publishing a Scowl change to reference the new version. For dynamic table functionality, see Tempoaral Aggregates.

The primary method for uploading a table is the Python SDK's create_table_from_dataframe method.

Lookup

Returns one or more fields from the table for the specified key

Syntax:

Lookup<table_name>(
    feature(s)
    by feature
)

Examples:

Lookup<geoip>(lat, lng by zip)
Lookup<region_to_iso2>(iso2 by billing_state)
Lookup<product_dim>(sku, category by product_id)

Dependencies

To reference a table in a Lookup feature, it must be added to a special Scowl file named deps.scowl. This file includes only require statements, which tell the topology to import a particular version of a named resource (i.e. a table).

Require

Import a named, versioned resource (i.e. table). You may reference a single table or multiple tables using the group syntax.

Syntax:

require table name version

require table (
    name version
    name version
    ...
)

Examples:

require table (
    geoip v20220927202524
    region_to_iso2 v20220927202928
)

require table product_dim v20220927220048

Attention

A require statement is only valid Scowl within your deps.scowl file

Versions

Version identifiers are automatically generated during the table upload process. The format is the letter v, followed by the UTC timestamp of the upload.

A new table version is created every time a table is uploaded for a given table name. Old versions are kept around and may be referenced in the LIVE topology or materialization experiments.

CLI

sumatra deps

The Sumatra CLI includes a sumatra deps command to help you manage the versions in your deps.scowl file:

To fetch the latest versions of all tables and save to your local deps.scowl file:

sumatra deps update

To preview the deps without saving them:

sumatra deps list

sumatra table

Additionally, the sumatra CLI includes a sumatra table command to inspect table versions.

To list all tables:

sumatra table list

To list all versions of a particular table:

sumatra table history my_table

Uploading Data

To create a table (or a new version of an existing table), use the create_table_from_dataframe method in the Python SDK.

In addtion to the table name and the Pandas dataframe, you must specify which column to use as the key (primary index) of the table.

Example

df = ...query data warehouse...

tbl = sumatra.create_table_from_dataframe('geozip', df, 'zip_code')
tbl.wait()

The create_table_from_dataframe method:

  1. Saves the dataframe to parquet
  2. Uploads the parquet file to Sumatra (S3)
  3. Kicks off a job to validate and load the data

Because step (3) may take a while, the method returns a handle that you can .wait() on. The handle's .status property indicates the status of the job: ProcessingReady (or Error).

Schema Inference

The table's schema is inferred from the dtypes of the Pandas dataframe. Currently, the supported types are Scowl's basic types: int, float, bool, string, time.

A time column should appear in the dataframe as a pandas.Timestamp. Note that due to a Pandas limitation, int columns may not contain null values, and will be cast to float if any nulls as present. Null values are supported for all other types.

To inspect the schema of an uploaded table, use the following CLI command:

sumatra table schema my_table

Restrictions

For a dataframe to be valid for upload, it must meet all of the following criteria:

  • All column names must match the pattern [a-z][a-zA-Z0-9_]*
  • The key column must appear in the list of columns
  • The values in the key column must be non-null and unique
  • The dtypes of all columns must be a supported type (see previous section)
  • Row count must not exceed the maximum

Change Management

Sumatra's table capability was designed to meet two important change management requirements:

  • Users can upload tables and experiment freely, without any worry that they will impact the LIVE topology.
  • Going live with a new table version requires publishing an updated Scowl topology, and all the oversight that entails.

The typical workflow for deploying a table update is:

  1. Upload table from dataframe
  2. Run sumatra deps update in your branch
  3. sumatra push your branch
  4. Run a materialization to validate the new table version
  5. git commit the change(s) to your deps.scowl file
  6. PR, code review, publish the updated scowl

Dev / Prod

Each Sumatra instance will have its own versions of resources. If, for example, the same table data is loaded into Dev and Prod, the table will be assigned a different version number in each, based on upload timestamp.

Therefore, if you plan to deploy the same folder of scowl files to multiple instances, you will need to keep a different deps.scowl file per instance. The convention is to use the local deps.scowl file for Prod and store the deps files for other instances elsewhere.

There are basically two options:

  1. Store Dev's deps.scowl file in a separate folder, e.g. ../dev/deps.scowl
  2. Maintain a deps.scowl.dev file in the primary folder. Note that the filename must have an extension other than .scowl to avoid conflicts.

To reference Dev's deps file, in CI/CD or manual use of the CLI, use the --deps-file parameter, e.g.:

sumatra pull --deps-file deps.scowl.dev
sumatra push --deps-file deps.scowl.dev
sumatra deps update --deps-file ../dev/deps.scowl
sumatra plan --deps-file ../dev/deps.scowl
sumatra apply --deps-file ../dev/deps.scowl

Note that, regardless of the local name used, the deps are maintained as deps.scowl in the server-side branch.

Deleting Tables

The sumatra table delete command allows you to delete a table, including its complete version history.

To prevent impacting live decisiong, Sumatra will not allow a table to be deleted if it is referenced in the LIVE topology.