Models
Models are a convenient way to incorporate Machine Learning into your Sumatra real-time data services. Like any other Sumatra feature, a model transforms input features into new output features. But in this case, the transformation is a machine learning model.
Benefits
One can always fetch Sumatra features from a standalone model prediction service, if that best suits your needs.
However, the key benefits of performing model inference directly in Sumatra are:
- Fast, self-service deployment, without the need to manage your own microservices
- Allows for post-scoring business rules like thresholds and exception lists directly in your Scowl code
- Model inference works in replay as well, for consistent online-offline prediction
Workflow
To build and deploy a Sumatra model:
:one: Train your model in any package that supports PMML
:two: Upload your model to Sumatra from the CLI:
sumatra model put "my_model_name" "my_model_v1.0.pmml" --comment "Initial xgboost"
Or equivalently from the Python Client:
create_model_from_pmml("my_model_name",
"my_model_v1.0.pmml",
comment="Initial xgboost")
:three: Import the latest model version in your deps.scowl
file:
require model cash_out v20230120212732
Tip
Run sumatra deps update
to fetch and save the latest versions of all resources
:four: Add a ModelPredict
call to your topology:
risk_score := ModelPredict<cash_out>({
amount,
dollars_in_out_1h,
dollars_out_by_email,
emails_per_bank,
emails_per_device
}).probability_fraud
:five: Publish your branch, as usual, to go Live.
Schema
When you upload a model artifact to Sumatra, it will infer both the input and output schemas, including the names and data types for each.
Both the input and output are represented as Scowl Structs.
The Models UI presents the schema in a Scowl-like format,
separated by the ->
operator, e.g.
model cash_out v20230120212732 {
amount: float,
dollars_in_out_1h: float,
dollars_out_by_email: float,
emails_per_bank: float,
emails_per_device: float
} -> {
probability_fraud: float,
probability_good: float
}
Warning
At model-training time, you must choose feature names that follow the strict requirements of Scowl feature names:
lowercase alphanumeric with underscores (i.e. [a-z][_a-z0-9]*
).
ModelPredict
To invoke the model on a particular set of features, call the ModelPredict
function
with the name of the model, e.g.:
ModelPredict<iris>({
sepal_length: 5.1,
sepal_width: 3.5,
petal_length: 1.4,
petal_width: 0.2
})
= {probability_setosa: 0.3, versicolor: 0.25, virginica: 0.45}
See the ModelPredict
docs for full details.
PMML
Sumatra currently supports models serialized to the PMML format.
The Predictive Model Markup Language (PMML) is an XML-based language for describing models in a standard format to allow for models trained in a variety of tools to be executed in a variety of environments without requiring a specific implementation for each combination.
Supported Tools
Many machine learning packages, across a variety of languages, are able to export to PMML:
- scikit-learn (Python)
- MLlib (Spark)
- dataiku
- many more ...
Supported Models
The full PMML standard supports a broad range of classification, regression, and preprocessing stages. Sumatra implements a core subset, which includes:
Preprocessing
- Missing value replacement (Impute)
- One-hot encoding
Classification
- XGBoost
- Random Forest
- Decision Tree
- Naive Bayes
Regression
- Linear
- Logistic