Tables
Sumatra tables provide a way to enrich events with externally-loaded batch data. A lookup feature fetches the table row for the given key and returns the requested field values.
Tables are static, meaning that the contents of the table are not updated by some ongoing process. Instead, a table may be updated by uploading a new version of the table, then publishing a Scowl change to reference the new version. For dynamic table functionality, see Tempoaral Aggregates.
The primary method for uploading a table is the
Python SDK's create_table_from_dataframe
method.
Lookup
Returns one or more fields from the table for the specified key
Syntax:
Lookup<table_name>(
feature(s)
by feature
)
Examples:
Lookup<geoip>(lat, lng by zip)
Lookup<region_to_iso2>(iso2 by billing_state)
Lookup<product_dim>(sku, category by product_id)
Dependencies
To reference a table in a Lookup
feature, it must be added to a
special Scowl file named deps.scowl
. This file includes only
require
statements, which tell the topology to import a particular
version of a named resource (i.e. a table).
Require
Import a named, versioned resource (i.e. table). You may reference a single table or multiple tables using the group syntax.
Syntax:
require table name version
require table (
name version
name version
...
)
Examples:
require table (
geoip v20220927202524
region_to_iso2 v20220927202928
)
require table product_dim v20220927220048
Attention
A require
statement is only valid Scowl within your deps.scowl
file
Versions
Version identifiers are automatically generated during the table upload
process. The format is the letter v
, followed by the UTC timestamp of
the upload.
A new table version is created every time a table is uploaded for a given table name. Old versions are kept around and may be referenced in the LIVE topology or materialization experiments.
CLI
sumatra deps
The Sumatra CLI includes a sumatra deps
command to help you manage the versions in your deps.scowl
file:
To fetch the latest versions of all tables and save to your local deps.scowl
file:
sumatra deps update
To preview the deps without saving them:
sumatra deps list
sumatra table
Additionally, the sumatra CLI includes a sumatra table
command to inspect table versions.
To list all tables:
sumatra table list
To list all versions of a particular table:
sumatra table history my_table
Uploading Data
To create a table (or a new version of an existing table), use the
create_table_from_dataframe
method in the Python SDK.
In addtion to the table name and the Pandas dataframe, you must specify which column to use as the key (primary index) of the table.
Example
df = ...query data warehouse...
tbl = sumatra.create_table_from_dataframe('geozip', df, 'zip_code')
tbl.wait()
The create_table_from_dataframe
method:
- Saves the dataframe to parquet
- Uploads the parquet file to Sumatra (S3)
- Kicks off a job to validate and load the data
Because step (3) may take a while, the method returns a handle
that you can .wait()
on. The handle's .status
property indicates
the status of the job: Processing
→ Ready
(or Error
).
Schema Inference
The table's schema is inferred from the dtypes of the Pandas
dataframe. Currently, the supported types are Scowl's basic types: int
, float
, bool
, string
, time
.
A time
column should appear in the dataframe as a pandas.Timestamp
. Note that due to a Pandas limitation, int
columns may not contain
null values, and will be cast to float
if any nulls as present. Null values are supported for all other types.
To inspect the schema of an uploaded table, use the following CLI command:
sumatra table schema my_table
Restrictions
For a dataframe to be valid for upload, it must meet all of the following criteria:
- All column names must match the pattern
[a-z][a-zA-Z0-9_]*
- The key column must appear in the list of columns
- The values in the key column must be non-null and unique
- The dtypes of all columns must be a supported type (see previous section)
- Row count must not exceed the maximum
Change Management
Sumatra's table capability was designed to meet two important change management requirements:
- Users can upload tables and experiment freely, without any worry that they will impact the LIVE topology.
- Going live with a new table version requires publishing an updated Scowl topology, and all the oversight that entails.
The typical workflow for deploying a table update is:
- Upload table from dataframe
- Run
sumatra deps update
in your branch sumatra push
your branch- Run a materialization to validate the new table version
git commit
the change(s) to yourdeps.scowl
file- PR, code review, publish the updated scowl
Dev / Prod
Each Sumatra instance will have its own versions of resources. If, for example, the same table data is loaded into Dev and Prod, the table will be assigned a different version number in each, based on upload timestamp.
Therefore, if you plan to deploy the same folder of scowl files to multiple instances, you will need to keep a different
deps.scowl
file per instance. The convention is to use the local deps.scowl
file for Prod and store the deps files
for other instances elsewhere.
There are basically two options:
- Store Dev's
deps.scowl
file in a separate folder, e.g.../dev/deps.scowl
- Maintain a
deps.scowl.dev
file in the primary folder. Note that the filename must have an extension other than.scowl
to avoid conflicts.
To reference Dev's deps file, in CI/CD or manual use of the CLI, use the --deps-file
parameter, e.g.:
sumatra pull --deps-file deps.scowl.dev
sumatra push --deps-file deps.scowl.dev
sumatra deps update --deps-file ../dev/deps.scowl
sumatra plan --deps-file ../dev/deps.scowl
sumatra apply --deps-file ../dev/deps.scowl
Note that, regardless of the local name used, the deps are maintained as deps.scowl
in the server-side branch.
Deleting Tables
The sumatra table delete
command allows you to delete a table,
including its complete version history.
To prevent impacting live decisiong, Sumatra will not allow a table to be deleted if it is referenced in the LIVE topology.