Apache Airflow Provider - DataRobot

DataRobot

PartnerMachine Learning

An Apache Airflow provider for DataRobot

Version
0.0.4
Downloads
145/month
Last Published
Jan. 6, 2023
Quick Install

DataRobot Provider for Apache Airflow

This package provides operators, sensors, and a hook to integrate DataRobot into Apache Airflow. Using these components, you should be able to build the essential DataRobot pipeline - create a project, train models, deploy a model, and score predictions against the model deployment.

Install the Airflow provider

The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:

To install the DataRobot provider, run the following command:

pip install airflow-provider-datarobot

Create a connection from Airflow to DataRobot

The next step is to create a connection from Airflow to DataRobot:

  1. In the Airflow user interface, click Admin > Connections to add an Airflow connection.

  2. On the List Connection page, click + Add a new record.

  3. In the Add Connection dialog box, configure the following fields:

    FieldDescription
    Connection Iddatarobot_default (this name is used by default in all operators)
    Connection TypeDataRobot
    API KeyA DataRobot API key, created in the DataRobot Developer Tools, from the API Keys section.
    DataRobot endpoint URLhttps://app.datarobot.com/api/v2 by default
  4. Click Test to establish a test connection between Airflow and DataRobot.

  5. When the connection test is successful, click Save.

Create preconfigured connections to DataRobot

You can create preconfigured connections to store and manage credentials to use with Airflow Operators, replicating the connection on the DataRobot side.

Currently, the supported credential types are:

CredentialsDescription
DataRobot Basic CredentialsLogin/password pairs
DataRobot GCP CredentialsGoogle Cloud Service account key
DataRobot AWS CredentialsAWS access keys
DataRobot Azure Storage CredentialsAzure Storage secret
DataRobot OAuth CredentialsOAuth tokens
DataRobot JDBC DataSourceJDBC connection attributes

After creating a preconfigured connection through the Airflow UI or API, you can access your stored credentials with GetOrCreateCredentialOperator or GetOrCreateDataStoreOperator to replicate them in DataRobot and retrieve the corresponding credentials_id or datastore_id.

JSON configuration for the DAG run

Operators and sensors use parameters from the config JSON submitted when triggering the DAG; for example:

{
"training_data": "s3-presigned-url-or-local-path-to-training-data",
"project_name": "Project created from Airflow",
"autopilot_settings": {
"target": "readmitted"
},
"deployment_label": "Deployment created from Airflow",
"score_settings": {
"intake_settings": {
"type": "s3",
"url": "s3://path/to/scoring-data/Diabetes10k.csv",
"credential_id": "<credential_id>"
},
"output_settings": {
"type": "s3",
"url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
"credential_id": "<credential_id>"
}
}
}

These config values are accessible in the execute() method of any operator in the DAG through the context["params"] variable; for example, to get training data, you could use the following:

def execute(self, context: Dict[str, Any]) -> str:
...
training_data = context["params"]["training_data"]
...

Modules

Operators

GetOrCreateCredentialOperator

Fetches a credential by name. This operator attempts to find a DataRobot credential with the provided name. If the credential doesn't exist, the operator creates it using the Airflow preconfigured connection with the same connection name.

Returns a credential ID.

Required config parameters:

ParameterTypeDescription
credentials_param_namestrThe name of parameter in the config file for the credential name.

GetOrCreateDataStoreOperator

Fetches a DataStore by Connection name. If the DataStore does not exist, the operator attempts to create it using Airflow preconfigured connection with the same connection name.

Returns a credential ID.

Required config params:

ParameterTypeDescription
connection_param_namestrThe name of the parameter in the config file for the connection name.

CreateDatasetFromDataStoreOperator

Loads a dataset from a JDBC Connection to the DataRobot AI Catalog.

Returns a dataset ID.

Required config params:

ParameterTypeDescription
datarobot_jdbc_connectionstrThe existing preconfigured DataRobot JDBC connection name.
dataset_namestrThe name of the loaded dataset.
table_schemastrThe database table schema.
table_namestrThe source table name.
do_snapshotboolIf True, creates a snapshot dataset. If False, creates a remote dataset. If unset, uses the server default (True). Creating snapshots from non-file sources may be disabled by the Disable AI Catalog Snapshots permission.
persist_data_after_ingestionboolIf True, enforce saving all data (for download and sampling) and allow a user to view the extended data profile (which includes data statistics like min, max, median, mean, histogram, etc.). If False, don't enforce saving data. If unset, uses the server default (True). The data schema (feature names and types) will still be available. Specifying this parameter to False and doSnapshot to True results in an error.

UploadDatasetOperator

Uploads a local file to the DataRobot AI Catalog.

Returns a dataset ID.

Required config params:

ParameterTypeDescription
dataset_file_pathstrThe local path to the training dataset.

UpdateDatasetFromFileOperator

Creates a new dataset version from a file.

Returns a dataset version ID when the new version uploads successfully.

Required config params:

ParameterTypeDescription
dataset_idstrThe DataRobot AI Catalog dataset ID.
dataset_file_pathstrThe local path to the training dataset.

CreateDatasetVersionOperator

Creates a new version of the existing dataset in the AI Catalog.

Returns a dataset version ID.

Required config params:

ParameterTypeDescription
dataset_idstrThe DataRobot AI Catalog dataset ID.
datasource_idstrThe existing DataRobot datasource ID.
credential_id strThe existing DataRobot credential ID.

CreateOrUpdateDataSourceOperator

Creates a data source or updates it if it already exists.

Returns a DataRobot DataSource ID.

Required config params:

ParameterTypeDescription
data_store_idstrTHe DataRobot datastore ID.

CreateProjectOperator

Creates a DataRobot project.

Returns a project ID.

Several options of source dataset supported:

Local file or pre-signed S3 URL

Create a project directly from a local file or a pre-signed S3 URL.

Required config params:

ParameterTypeDescription
training_datastrThe pre-signed S3 URL or the local path to the training dataset.
project_namestrThe project name.

Note: In case of an S3 input, the training_data value must be a pre-signed AWS S3 URL.

AI Catalog dataset from config file

Create a project from an existing dataset in the DataRobot AI Catalog using a dataset ID defined in the config file.

Required config params:

ParameterTypeDescription
training_dataset_idstrThe dataset ID corresponding to existing dataset in the DataRobot AI Catalog.
project_namestrThe project name.
AI Catalog dataset from previous operator

Create a project from an existing dataset in the DataRobot AI Catalog using a dataset ID from the previous operator. In this case, your previous operator must return a valid dataset ID (for example UploadDatasetOperator) and you should use this output value as a dataset_id argument in the CreateProjectOperator object creation step.

Required config params:

ParameterTypeDescription
project_namestrThe project name.

For more project settings, see the DataRobot documentation.


TrainModelsOperator

Runs DataRobot Autopilot to train models.

Returns None.

Parameters:

ParameterTypeDescription
project_idstrThe DataRobot project ID.

Required config params:

ParameterTypeDescription
targetstrThe name of the column defining the modeling target.
"autopilot_settings": {
"target": "readmitted"
}

For more autopilot settings, see the DataRobot documentation.


DeployModelOperator

Deploy a specified model.

Returns a deployment ID.

Parameters:

ParameterTypeDescription
model_idstrThe DataRobot model ID.

Required config params:

ParameterTypeDescription
deployment_labelstrThe deployment label name.

For more deployment settings, see the DataRobot documentation.


DeployRecommendedModelOperator

Deploys a recommended model.

Returns a deployment ID.

Parameters:

ParameterTypeDescription
project_idstrThe DataRobot project ID.

Required config params:

ParameterTypeDescription
deployment_labelstrThe deployment label name.

For more deployment settings, see the DataRobot documentation.


ScorePredictionsOperator

Scores batch predictions against the deployment.

Returns a batch prediction job ID.

Prerequisites:

  • Use GetOrCreateCredentialOperator to pass a credential_id from the preconfigured DataRobot Credentials (Airflow Connections) or manually set the credential_id parameter in the config.

    Note: You can add S3 credentials to DataRobot via the Python API client.

  • Or use a Dataset ID from the DataRobot AI Catalog.

  • Or use a DataStore ID for a JDBC source connection; you can use GetOrCreateDataStoreOperator to pass datastore_id from a preconfigured Airflow Connection.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.
intake_datastore_idstrThe DataRobot datastore ID for the JDBC source connection.
output_datastore_idstrThe DataRobot datastore ID for the JDBC destination connection.
intake_credential_idstrThe DataRobot credentials ID for the source connection.
output_credential_idstrThe DataRobot credentials ID for the destination connection.
Sample config: Pre-signed S3 URL
"score_settings": {
"intake_settings": {
"type": "s3",
"url": "s3://my-bucket/Diabetes10k.csv",
},
"output_settings": {
"type": "s3",
"url": "s3://my-bucket/Diabetes10k_predictions.csv",
}
}
Sample config: Pre-signed S3 URL with a manually set credential ID
"score_settings": {
"intake_settings": {
"type": "s3",
"url": "s3://my-bucket/Diabetes10k.csv",
"credential_id": "<credential_id>"
},
"output_settings": {
"type": "s3",
"url": "s3://my-bucket/Diabetes10k_predictions.csv",
"credential_id": "<credential_id>"
}
}
Sample config: Scoring dataset in the AI Catalog
"score_settings": {
"intake_settings": {
"type": "dataset",
"dataset_id": "<datasetId>",
},
"output_settings": { }
}

For more batch prediction settings, see the DataRobot documentation.


GetTargetDriftOperator

Gets the target drift from a deployment.

Returns a dict with the target drift data.

Parameters:

ParameterTypeDescription
deployment_idstrTHe DataRobot deployment ID.

No config params are required; however, the optional params may be passed in the config as follows:

"target_drift": { }

GetFeatureDriftOperator

Gets the feature drift from a deployment.

Returns a dict with the feature drift data.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

No config params are required; however, the optional params may be passed in the config as follows:

"feature_drift": { }

GetServiceStatsOperator

Gets service stats measurements from a deployment.

Returns a dict with the service stats measurements data.

Parameters:

ParameterTypeDescription
deployment_idstrTHe DataRobot deployment ID.

No config params are required; however, the optional params may be passed in the config as follows:

"service_stats": { }

GetAccuracyOperator

Gets the accuracy of a deployment’s predictions.

Returns a dict with the accuracy for a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

No config params are required; however, the optional params may be passed in the config as follows:

"accuracy": { }

GetBiasAndFairnessSettingsOperator

Gets the Bias And Fairness settings for deployment.

Returns a dict with the Bias And Fairness settings for a Deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

No config params are required.


UpdateBiasAndFairnessSettingsOperator

Updates the Bias And Fairness settings for deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

Sample config params:

"protected_features": ["attribute1"],
"preferable_target_value": "True",
"fairness_metrics_set": "equalParity",
"fairness_threshold": 0.1,

GetSegmentAnalysisSettingsOperator

Gets the segment analysis settings for a deployment.

Returns a dict with the segment analysis settings for a deployment

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

No config params are required.


UpdateSegmentAnalysisSettingsOperator

Updates the segment analysis settings for a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

Sample config params:

"segment_analysis_enabled": True,
"segment_analysis_attributes": ["attribute1", "attribute2"],

GetMonitoringSettingsOperator

Gets the monitoring settings for deployment.

Returns a dict with the config params for a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

No config params are required.

Sample monitoring settings:

{
"drift_tracking_settings": { }
"association_id_settings": { }
"predictions_data_collection_settings": { }
}
DictionaryDescription
drift_tracking_settingsThe drift tracking settings for this deployment.
association_id_settingsThe association ID settings for this deployment.
predictions_data_collection_settingsThe predictions data collection settings of this deployment.

UpdateMonitoringSettingsOperator

Updates monitoring settings for a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.

Sample config params:

"target_drift_enabled": True,
"feature_drift_enabled": True,
"association_id_column": ["id"],
"required_association_id": False,
"predictions_data_collection_enabled": False,

BatchMonitoringOperator

Creates a batch monitoring job for the deployment.

Returns a batch monitoring job ID.

Prerequisites:

  • Use GetOrCreateCredentialOperator to pass a credential_id from the preconfigured DataRobot Credentials (Airflow Connections) or manually set the credential_id parameter in the config.

    Note: You can add S3 credentials to DataRobot via the Python API client.

  • Or use a Dataset ID from the DataRobot AI Catalog.

  • Or use a DataStore ID for a JDBC source connection; you can use GetOrCreateDataStoreOperator to pass datastore_id from a preconfigured Airflow Connection.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.
datastore_idstrThe DataRobot datastore ID.
credential_idstrThe DataRobot credentials ID.

Sample config params:

Sample config
"deployment_id": "61150a2fadb5586af4118980",
"monitoring_settings": {
"intake_settings": {
"type": "bigquery",
"dataset": "integration_example_demo",
"table": "actuals_demo",
"bucket": "datarobot_demo_airflow",
},
"monitoring_columns": {
"predictions_columns": [
{"class_name": "True", "column_name": "target_True_PREDICTION"},
{"class_name": "False", "column_name": "target_False_PREDICTION"},
],
"association_id_column": "id",
"actuals_value_column": "ACTUAL",
},
}
Sample config: Manually set credential ID
"deployment_id": "61150a2fadb5586af4118980",
"monitoring_settings": {
"intake_settings": {
"type": "bigquery",
"dataset": "integration_example_demo",
"table": "actuals_demo",
"bucket": "datarobot_demo_airflow",
"credential_id": "<credential_id>"
},
"monitoring_columns": {
"predictions_columns": [
{"class_name": "True", "column_name": "target_True_PREDICTION"},
{"class_name": "False", "column_name": "target_False_PREDICTION"},
],
"association_id_column": "id",
"actuals_value_column": "ACTUAL",
},
}

For more batch monitoring settings, see the DataRobot documentation.


DownloadModelScoringCodeOperator

Downloads scoring code artifact from a model.

Parameters:

ParameterTypeDescription
project_idstrThe DataRobot project ID.
model_idstrThe DataRobot model ID.
base_pathstrThe base path for storing a downloaded model artifact.

Sample config params:

"source_code": False,

For more scoring code download parameters, see the DataRobot documentation.


DownloadDeploymentScoringCodeOperator

Downloads scoring code artifact from a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.
base_pathstrThe base path for storing a downloaded model artifact.

Sample config params:

"source_code": False,
"include_agent": False,
"include_prediction_explanations": False,
"include_prediction_intervals": False,

For more scoring code download parameters, see the DataRobot documentation.


SubmitActualsFromCatalogOperator

Downloads scoring code artifact from a deployment.

Returns an actuals upload job ID.

Parameters:

ParameterTypeDescription
deployment_idstrThe DataRobot deployment ID.
dataset_idstrThe DataRobot AI Catalog dataset ID.
dataset_version_idstrThe DataRobot AI Catalog dataset version ID.

Sample config params:

"association_id_column": "id",
"actual_value_column": "ACTUAL",
"timestamp_column": "timestamp",

StartAutopilotOperator

Triggers DataRobot Autopilot to train a set of models.

Parameters:

ParameterTypeDescription
project_idstrThe DataRobot project ID.
featurelist_idstrSpecifies which feature list to use.
relationships_configuration_idstrID of the relationships configuration to use.
segmentation_task_idstrID of the segementation task to use.

Sample config params:

"autopilot_settings": {
"target": "column_name",
"mode": AUTOPILOT_MODE.QUICK,
}

For more analyze_and_model parameters, see the DataRobot documentation.


CreateExecutionEnvironmentOperator

Create an execution environment.

Returns an execution environment ID.

Parameters:

ParameterTypeDescription
namestrThe execution environment name.
descriptionstrThe execution environment description.
programming_languagestrThe programming language of the environment to be created.

Sample config params:

"execution_environment_name": "My Demo Env",
"custom_model_description": "This is a custom model created by Airflow",
"programming_language": "python",

For more execution environment creation parameters, see the DataRobot documentation.


CreateExecutionEnvironmentVersionOperator

Create an execution environment version.

Returns a version ID for the newly created execution environment .

Parameters:

ParameterTypeDescription
execution_environment_idstrThe ID of the execution environment.
docker_context_pathstrThe file path to a Docker context archive or folder.
environment_version_labelstrA short, human-readable string to label the environment version.
environment_version_descriptionstrThe execution environment version description.

For more execution environment version creation parameters, see the DataRobot documentation.


CreateCustomInferenceModelOperator

Create a custom inference model.

Returns the ID for the created custom model.

Parameters:

ParameterTypeDescription
namestrName of the custom model.
descriptionstrDescription of the custom model.

Sample DAG config params:

"target_type": - Target type of the custom inference model.
Values: [`datarobot.TARGET_TYPE.BINARY`, `datarobot.TARGET_TYPE.REGRESSION`,
`datarobot.TARGET_TYPE.MULTICLASS`, `datarobot.TARGET_TYPE.UNSTRUCTURED`]
"target_name": - Target feature name.
It is optional (ignored if provided) for `datarobot.TARGET_TYPE.UNSTRUCTURED` target type.
"programming_language": - Programming language of the custom learning model.
"positive_class_label": - Custom inference model positive class label for binary classification.
"negative_class_label": - Custom inference model negative class label for binary classification.
"prediction_threshold": - Custom inference model prediction threshold.
"class_labels": - Custom inference model class labels for multiclass classification.
"network_egress_policy": - Determines whether the given custom model is isolated, or can access the public network.
"maximum_memory": - The maximum memory that might be allocated by the custom model.
"replicas": - A fixed number of replicas that will be deployed in the cluster.

For more custom inference model creation parameters, see the DataRobot documentation.


CreateCustomModelVersionOperator

Create a custom model version.

Returns the version ID for the created custom model.

Parameters:

ParameterTypeDescription
custom_model_idstrThe ID of the custom model.
base_environment_idstrThe ID of the base environment to use with the custom model version.
training_dataset_idstrThe ID of the training dataset to assign to the custom model.
holdout_dataset_idstrThe ID of the holdout dataset to assign to the custom model.
custom_model_folderstrThe path to a folder containing files to be uploaded. Each file in the folder is uploaded under path relative to a folder path.
create_from_previousboolIf set to True, this parameter creates a custom model version containing files from a previous version.

Sample DAG config params:

"is_major_update" - The flag defining if a custom model version will be a minor or a major version.
"files" - The list of tuples, where values in each tuple are the local filesystem path and
the path the file should be placed in the model.
"files_to_delete" - The list of a file items IDs to be deleted.
"network_egress_policy": - Determines whether the given custom model is isolated, or can access the public network.
"maximum_memory": - The maximum memory that might be allocated by the custom model.
"replicas": - A fixed number of replicas that will be deployed in the cluster.
"required_metadata_values" - Additional parameters required by the execution environment.
"keep_training_holdout_data" - If the version should inherit training and holdout data from the previous version.

For more custom inference model creation parameters, see the DataRobot documentation.


CustomModelTestOperator

Create and start a custom model test.

Returns an ID for the custom model test.

Parameters:

ParameterTypeDescription
custom_model_idstrThe ID of the custom model.
custom_model_version_idstrThe ID of the custom model version.
dataset_idstrThe ID of the testing dataset for structured custom models. Ignored and not required for unstructured models.

Sample DAG config params:

"network_egress_policy": - Determines whether the given custom model is isolated, or can access the public network.
"maximum_memory": - The maximum memory that might be allocated by the custom model.
"replicas": - A fixed number of replicas that will be deployed in the cluster.

For more custom model test creation parameters, see the DataRobot documentation.


GetCustomModelTestOverallStatusOperator

Get the overall status for custom model tests.

Returns the custom model test status.

Parameters:

ParameterTypeDescription
custom_model_test_idstrThe ID of the custom model test.

For more custom model test get status parameters, see the DataRobot documentation.


CreateCustomModelDeploymentOperator

Create a deployment from a DataRobot custom model image.

Returns the deployment ID.

Parameters:

ParameterTypeDescription
custom_model_version_idstrThe ID of the deployed custom model.
deployment_namestrA human-readable label for the deployment.
default_prediction_server_idstrAn identifier for the default prediction server.
descriptionstrA human-readable description of the deployment.
importancestrThe deployment importance level.

For more create_from_custom_model_version parameters, see the DataRobot documentation.


GetDeploymentModelOperator

Gets information about the deployment's current model.

Returns a model information from a deployment.

Parameters:

ParameterTypeDescription
deployment_idstrAn identifier for the deployed model.

For more get deployment parameters, see the DataRobot documentation.


ReplaceModelOperator

Replaces the current model for a deployment.

Returns model information for the mode replacing the deployed model.

Parameters:

ParameterTypeDescription
deployment_idstrAn identifier for the deployed model.
new_model_idstrThe ID of the replacement model. If you are replacing the deployment's model with a custom inference model, you must use a specific custom model version ID.
reasonstrThe reason for the model replacement. Must be one of 'ACCURACY', 'DATA_DRIFT', 'ERRORS', 'SCHEDULED_REFRESH', 'SCORING_SPEED', or 'OTHER'. This value will be stored in the model history to keep track of why a model was replaced.

For more replace_model parameters, see the DataRobot documentation.


ActivateDeploymentOperator

Activate or deactivate a Deployment.

Returns the Deployment status (active or inactive).

Parameters:

ParameterTypeDescription
deployment_idstrAn identifier for the deployed model.
activatestrIf set to True, this parameter activates the deployment. Set to False to deactivate the deployment.

For more activate deployment parameters, see the DataRobot documentation.


GetDeploymentStatusOperator

Get the deployment status (active or inactive).

Returns the deployment status.

Parameters:

ParameterTypeDescription
deployment_idstrAn identifier for the deployed model.

For more deployment parameters, see the DataRobot documentation.


RelationshipsConfigurationOperator

Creates a relationship configuration.

Returns the relationships configuration ID.

Parameters:

ParameterTypeDescription
dataset_definitionsstrA list of dataset definitions. Each element is a dict retrieved from the DatasetDefinitionOperator operator.
relationshipsstrA list of relationships. Each element is a dict retrieved from DatasetRelationshipOperator operator.
feature_discovery_settingsstrOptional. A list of Feature Discovery settings. If not provided, it will be retrieved from the DAG configuration parameters. Otherwise, default settings are used.

For more Feature Discovery parameters, see the DataRobot documentation.


DatasetDefinitionOperator

Creates a dataset definition for Feature Discovery.

Returns a dataset definition dict.

Parameters:

ParameterTypeDescription
dataset_identifierstrThe alias of the dataset, used directly as part of the generated feature names.
dataset_idstrThe identifier of the dataset in the AI Catalog.
dataset_version_idstrThe identifier of the dataset version in the AI Catalog.
primary_temporal_keystrThe name of the column indicating the time of record creation.
feature_list_idstrSpecifies the feature list to use.
snapshot_policystrThe policy to use when creating a project or making predictions. If omitted, the endpoint will use 'latest' by default.

For more create-dataset-definitions-and-relationships-using-helper-functions, see the DataRobot documentation.


DatasetRelationshipOperator

Create a relationship between datasets defined in DatasetDefinition.

Returns a dataset definition dict.

Parameters:

ParameterTypeDescription
dataset1_identifierList[str]Identifier of the first dataset in this relationship. This is specified in the identifier field of the dataset_definition structure. If set to None, then the relationship is with the primary dataset.
dataset2_identifierList[str]Identifier of the second dataset in this relationship. This is specified in the identifier field of the dataset_definition schema.
dataset1_keysList[str]A list of strings (max length: 10 min length: 1). The column(s) from the first dataset which are used to join to the second dataset.
dataset2_keysList[str]A list of strings (max length: 10 min length: 1). The column(s) from the second dataset that are used to join to the first dataset.
feature_derivation_window_startintHow many time units of each dataset's primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should begin. Will be a negative integer, If present, the feature engineering graph performs time-aware joins.
feature_derivation_window_endintDetermines how many units of time of each dataset's record primary temporal key into the past relative to the datetimePartitionColumn the feature derivation window should end. It is a non-positive integer if present. If present, the feature engineering graph performs time-aware joins.
feature_derivation_window_time_unitstrThe unit of time the feature derivation window. One of datarobot.enums.AllowedTimeUnitsSAFER If present, time-aware joins will be used. Only applicable when dataset1_identifier is not provided.
feature_derivation_windowsListA list of feature derivation windows settings. If present, time-aware joins will be used. Only allowed when feature_derivation_window_start, feature_derivation_window_end , and feature_derivation_window_time_unit are not provided.
prediction_point_roundingList[dict]Closest value of prediction_point_rounding_time_unit to round the prediction point into the past when applying the feature derivation if present. Only applicable when dataset1_identifier is not provided.
prediction_point_rounding_time_unitstrTime unit of the prediction point rounding. One of datarobot.enums.AllowedTimeUnitsSAFER. Only applicable when dataset1_identifier is not provided.

For more create-dataset-definitions-and-relationships-using-helper-functions, see the DataRobot documentation.


ComputeFeatureImpactOperator

Creates a Feature Impact job in DataRobot.

Returns a Feature Impact job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.

For more request_feature_impact, see the DataRobot documentation.


ComputeFeatureEffectsOperator

Submit a request to compute Feature Effects for the model.

Returns the Feature Effects job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.

For more request_feature_impact parameters, see the DataRobot documentation.


ComputeShapOperator

Submit a request to compute a SHAP impact job for the model.

Returns a SHAP impact job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.

For more shap-impact parameters, see the DataRobot documentation.


CreateExternalModelPackageOperator

Create an external model package in DataRobot MLOps from JSON configuration.

Returns a model package ID of the newly created model package.

Parameters:

ParameterTypeDescription
model_infostrA JSON object of external model parameters.

Example of JSON configuration for a regression model:

.. code-block:: python

{
"name": "Lending club regression",
"modelDescription": {
"description": "Regression on lending club dataset"
}
"target": {
"type": "Regression",
"name": "loan_amnt"
}
}

Example JSON for a binary classification model:

.. code-block:: python

{
"name": "Surgical Model",
"modelDescription": {
"description": "Binary classification on surgical dataset",
"location": "/tmp/myModel"
},
"target": {
"type": "Binary",
"name": "complication",
"classNames": ["Yes","No"], # minority/positive class should be listed first
"predictionThreshold": 0.5
}
}
}

Example JSON for a multiclass classification model:

.. code-block:: python

{
"name": "Iris classifier",
"modelDescription": {
"description": "Classification on iris dataset",
"location": "/tmp/myModel"
},
"target": {
"type": "Multiclass",
"name": "Species",
"classNames": [
"Iris-versicolor",
"Iris-virginica",
"Iris-setosa"
]
}
}

DeployModelPackageOperator

Create a deployment from a DataRobot model package.

Returns the created deployment ID.

Parameters:

ParameterTypeDescription
deployment_namestrA human readable label of the deployment.
model_package_idstrThe ID of the DataRobot model package to deploy.
default_prediction_server_idstrAn identifier of a prediction server to be used as the default prediction server. When working with prediction environments, the default prediction server ID should not be provided.
prediction_environment_idstrAn identifier of a prediction environment to be used for model deployment.
descriptionstrA human readable description of the deployment.
importancestrDeployment importance level.
user_provided_idstrA user-provided unique ID associated with a deployment definition in a remote git repository.
additional_metadataDict[str, str]A Key/Value pair dict, with additional metadata.

AddExternalDatasetOperator

Upload a new dataset from the AI Catalog to make predictions for a model.

Returns an external dataset ID for the model,

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
dataset_idstrDataRobot AI Catalog dataset ID.
credential_idstrDataRobot credentials ID.
dataset_version_idstrDataRobot AI Catalog dataset version ID.

For more upload_dataset_from_catalog parameters, see the DataRobot documentation.


RequestModelPredictionsOperator

Requests predictions against a previously uploaded dataset.

Returns a model predictions job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.
external_dataset_idstrDataRobot external dataset ID.

For more request_predictions, see the DataRobot documentation.


TrainModelOperator

Submit a job to the queue to train a model from a specific blueprint.

Returns a model training job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
blueprint_idstrDataRobot blueprint ID.
featurelist_idstrThe identifier of the feature list to use. If not defined, the default feature list for this project is used.
source_project_idstrThe source project that created the blueprint_id. If None, it defaults to looking in this project. Note that you must have read permissions in this project.

Example of DAG config params: { "sample_pct": "scoring_type": "training_row_count": "n_clusters": }

For more start-training-a-model, see the DataRobot documentation.


RetrainModelOperator

Submit a job to the queue to retrain a model on a specific sample size and/or custom feature list.

Returns a model retraining job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.
featurelist_idstrThe identifier of the feature list to use. If not defined, the default for this project is used.

Example of DAG config params: { "sample_pct": "scoring_type": "training_row_count": }

For more train-a-model-on-a-different-sample-size, see the DataRobot documentation.


PredictionExplanationsInitializationOperator

Initialize prediction explanations for a model.

Returns a prediction explanations initialization job ID.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.

For more prediction-explanations, see the DataRobot documentation.


ComputePredictionExplanationsOperator

Create prediction explanations for the specified dataset.

Returns a job ID for the prediction explanations for the specified dataset.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
model_idstrDataRobot model ID.
external_dataset_idstrDataRobot external dataset ID.

Example of DAG config params:

{ "max_explanations" "threshold_low" "threshold_high" }

For more prediction-explanations, see the DataRobot documentation.


Sensors

AutopilotCompleteSensor

Checks if Autopilot is complete.

Parameters:

ParameterTypeDescription
project_idstrThe DataRobot project ID.

ScoringCompleteSensor

Checks if batch scoring is complete.

Parameters:

ParameterTypeDescription
job_idstrThe batch prediction job ID.

MonitoringJobCompleteSensor

Checks if a monitoring job is complete.

Parameters:

ParameterTypeDescription
job_idstrThe batch monitoring job ID.

BaseAsyncResolutionSensor

Checks if the DataRobot Async API call is complete.

Parameters:

ParameterTypeDescription
job_idstrThe DataRobot async API call status check ID.

DataRobotJobSensor

Checks whether a DataRobot job is complete.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
job_idstrDataRobot job ID.

ModelTrainingJobSensor

Checks whether a DataRobot model training job is complete.

Returns False if the job is not yet completed, and returns PokeReturnValue(True, trained_model.id) if model training has completed.

Parameters:

ParameterTypeDescription
project_idstrDataRobot project ID.
job_idstrDataRobot job ID.

Hooks

DataRobotHook

A hook to initialize the DataRobot Public API client.


Pipeline

The modules described above allow you to construct a standard DataRobot pipeline in an Airflow DAG:

create_project_op >> train_models_op >> autopilot_complete_sensor >> deploy_model_op >> score_predictions_op >> scoring_complete_sensor

Example DAGS

See the the datarobot_provider/example_dags directory for the example DAGs.

You can find the following examples using a preconfigured connection in the datarobot_provider/example_dags directory:

Example DAGDescription
datarobot_pipeline_dag.pyRun the basic end-to-end workflow in DataRobot.
datarobot_score_dag.pyPerform DataRobot batch scoring.
datarobot_jdbc_batch_scoring_dag.pyPerform DataRobot batch scoring with a JDBC data source.
datarobot_aws_s3_batch_scoring_dag.pyUse DataRobot AWS Credentials with ScorePredictionsOperator.
datarobot_gcp_storage_batch_scoring_dag.pyUse DataRobot GCP Credentials with ScorePredictionsOperator.
datarobot_bigquery_batch_scoring_dag.pyUse DataRobot GCP Credentials with ScorePredictionsOperator.
datarobot_azure_storage_batch_scoring_dag.pyUse DataRobot Azure Storage Credentials with ScorePredictionsOperator.
datarobot_jdbc_dataset_dag.pyUpload a dataset to the AI Catalog through a JDBC connection.
datarobot_batch_monitoring_job_dag.pyRun a batch monitoring job.
datarobot_create_project_from_ai_catalog_dag.pyCreate a DataRobot project from a DataRobot AI Catalog dataset.
datarobot_create_project_from_dataset_version_dag.pyCreate a DataRobot project from a specific dataset version in the DataRobot AI Catalog.
datarobot_dataset_new_version_dag.pyCreate a new version of an existing dataset in the AI Catalog.
datarobot_dataset_upload_dag.pyUpload a local file to the AI Catalog.
datarobot_get_datastore_dag.pyCreate a DataRobot data store with GetOrCreateDataStoreOperator.
datarobot_jdbc_dataset_dag.pyCreate a DataRobot project from a JDBC data source.
datarobot_jdbc_dynamic_dataset_dag.pyCreate a DataRobot project from a JDBC dynamic data source.
datarobot_upload_actuals_catalog_dag.pyUpload actuals from the DataRobot AI Catalog.
deployment_service_stats_dag.pyGet a deployment's service statistics with GetServiceStatsOperator.
deployment_stat_and_accuracy_dag.pyGet a deployment's service statistics and accuracy.
deployment_update_monitoring_settings_dag.pyUpdate a deployment's monitoring settings.
deployment_update_segment_analysis_settings_dag.pyUpdate a deployment's segment analysis settings.
download_scoring_code_from_deployment_dag.pyDownload a Scoring Code JAR file from a DataRobot deployment.
advanced_datarobot_pipeline_jdbc_dag.pyRun the advanced end-to-end workflow in DataRobot.
datarobot_autopilot_options_pipeline_dag.pyCreates a DataRobot project and starts Autopilot with advanced options.
datarobot_custom_model_pipeline_dag.pyCreate an end-to-end workflow with custom models in DataRobot.
datarobot_custom_partitioning_pipeline_dag.pyCreate a custom partitioned project and train models.
datarobot_datetime_partitioning_pipeline_dag.pyCreate a datetime partitioned project.
datarobot_external_model_pipeline_dag.pyAn end-to-end workflow with external models in DataRobot.
datarobot_feature_discovery_pipeline_dag.pyCreate a Feature Discovery project and train models.
datarobot_timeseries_pipeline_dag.pyCreate a time series DataRobot project.
deployment_activate_deactivate_dag.pyAn example of deployment activation/deactivation and getting deployment status.
deployment_replace_model_dag.pyAn example of model replacement for deployments.
model_compute_insights_dag.pyAn example of computing Feature Impact and Feature Effects.
model_compute_prediction_explanations_dag.pyAn example of a compute prediction explanations job.
model_compute_predictions_dag.pyAn example of computing predictions for model.
model_compute_shap_dag.pyAn example of computing SHAP.
model_retrain_dag.pyExample of model retraining job on a specific sample size/featurelist.
model_train_dag.pyExample of model training job based on specific blueprint.

The advanced end-to-end workflow in DataRobot (advanced_datarobot_pipeline_jdbc_dag.py) contains the following steps:

  • Ingest a dataset to the AI Catalog from JDBC datasource
  • Create a DataRobot project
  • Train models using Autopilot
  • Deploy the recommended model
  • Change deployment settings (enable monitoring settings, segment analysis, and bias and fairness)
  • Run batch scoring using a JDBC datasource
  • Upload actuals from a JDBC datasource
  • Collect deployment metrics: service statistics, features drift, target drift, accuracy and process it with custom python operator.

Issues

Please submit issues and pull requests in our official repo: https://github.com/datarobot/airflow-provider-datarobot

We are happy to hear from you. Please email any feedback to the authors at support@datarobot.com.

Copyright Notice

Copyright 2023 DataRobot, Inc. and its affiliates.

All rights reserved.

This is proprietary source code of DataRobot, Inc. and its affiliates.

Released under the terms of DataRobot Tool and Utility Agreement.