The User’s API

miraiml provides the following components:

miraiml.SearchSpace

class miraiml.SearchSpace(id, model_class, parameters_values=None, parameters_rules=<function SearchSpace.<lambda>>)[source]

This class represents the search space of hyperparameters for a base model.

Parameters:
  • id (str) – The id that will be associated with the models generated within this search space.
  • model_class (type) – Any class that represents a statistical model. It must implement the methods fit as well as predict for regression or predict_proba for classification problems.
  • parameters_values (dict, optional, default=None) – A dictionary containing lists of values to be tested as parameters when instantiating objects of model_class for id.
  • parameters_rules (function, optional, default=lambda x: None) –

    A function that constrains certain parameters because of the values assumed by others. It must receive a dictionary as input and doesn’t need to return anything. Not used if parameters_values has no keys.

    Warning

    Make sure that the parameters accessed in parameters_rules exist in the set of parameters defined on parameters_values, otherwise the engine will attempt to access an invalid key.

Raises:

NotImplementedError if a model class does not implement fit or none of predict or predict_proba.

Raises:

TypeError if some parameter is of a prohibited type.

Raises:

ValueError if a provided id is not allowed.

Example:
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from miraiml import SearchSpace

>>> def logistic_regression_parameters_rules(parameters):
...     if parameters['solver'] in ['newton-cg', 'sag', 'lbfgs']:
...         parameters['penalty'] = 'l2'

>>> search_space = SearchSpace(
...     id = 'Logistic Regression',
...     model_class = LogisticRegression,
...     parameters_values = {
...         'penalty': ['l1', 'l2'],
...         'C': np.arange(0.1, 2, 0.1),
...         'max_iter': np.arange(50, 300),
...         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
...         'random_state': [0]
...     },
...     parameters_rules = logistic_regression_parameters_rules
... )

Warning

Do not allow random_state assume multiple values. If model_class has a random_state parameter, force the engine to always choose the same value by providing a list with a single element.

Allowing random_state to assume multiple values will confuse the engine because the scores will be unstable even with the same choice of hyperparameters and features.

miraiml.Config

class miraiml.Config(local_dir, problem_type, score_function, search_spaces, use_all_features=False, n_folds=5, stratified=True, ensemble_id=None, stagnation=60)[source]

This class defines the general behavior of the engine.

Parameters:
  • local_dir (str) – The name of the folder in which the engine will save its internal files. If the directory doesn’t exist, it will be created automatically. .. and / are not allowed to compose local_dir.
  • problem_type (str) – 'classification' or 'regression'. The problem type. Multi-class classification problems are not supported.
  • search_spaces (list) – The list of miraiml.SearchSpace objects to optimize. If search_spaces has length 1, the engine will not run ensemble cycles.
  • score_function (function) – A function that receives the “truth” and the predictions (in this order) and returns the score. Bigger scores must mean better models.
  • use_all_features (bool, optional, default=False) – Whether to force MiraiML to always use all features or not.
  • n_folds (int, optional, default=5) – The number of folds for the fitting/predicting process. The minimum value allowed is 2.
  • stratified (bool, optional, default=True) – Whether to stratify folds on target or not. Only used if problem_type == 'classification'.
  • ensemble_id (str, optional, default=None) – The id for the ensemble. If none is given, the engine will not ensemble base models.
  • stagnation (int or float, optional, default=60) –

    The amount of time (in minutes) for the engine to automatically interrupt itself if no improvement happens. Negative numbers are interpreted as “infinite”.

    Warning

    Stagnation checks only happen after the engine finishes at least one optimization cycle. In other words, every base model and the ensemble (if set) must be scored at least once.

Raises:

NotImplementedError if a model class does not implement the proper method for prediction.

Raises:

TypeError if some parameter is not of its allowed type.

Raises:

ValueError if some parameter has an invalid value.

Example:
>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.tree import DecisionTreeClassifier
>>> from miraiml import SearchSpace, Config

>>> search_spaces = [
...     SearchSpace('Naive Bayes', GaussianNB),
...     SearchSpace('Decicion Tree', DecisionTreeClassifier)
... ]

>>> config = Config(
...     local_dir = 'miraiml_local',
...     problem_type = 'classification',
...     score_function = roc_auc_score,
...     search_spaces = search_spaces,
...     use_all_features = False,
...     n_folds = 5,
...     stratified = True,
...     ensemble_id = 'Ensemble',
...     stagnation = -1
... )

miraiml.Engine

class miraiml.Engine(config, on_improvement=None)[source]

This class offers the controls for the engine.

Parameters:
  • config (miraiml.Config) – The configurations for the behavior of the engine.
  • on_improvement (function, optional, default=None) – A function that will be executed everytime the engine finds an improvement for some id. It must receive a status parameter, which is the return of the method request_status() (an instance of miraiml.Status).
Raises:

TypeError if config is not an instance of miraiml.Config or on_improvement (if provided) is not callable.

Example:
>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.tree import DecisionTreeClassifier
>>> from miraiml import SearchSpace, Config, Engine

>>> search_spaces = [
...     SearchSpace('Naive Bayes', GaussianNB),
...     SearchSpace('Decision Tree', DecisionTreeClassifier)
... ]

>>> config = Config(
...     local_dir = 'miraiml_local',
...     problem_type = 'classification',
...     score_function = roc_auc_score,
...     search_spaces = search_spaces,
...     ensemble_id = 'Ensemble'
... )

>>> def on_improvement(status):
...     print('Scores:', status.scores)

>>> engine = Engine(config, on_improvement=on_improvement)
is_running Tells whether the engine is running or not.
interrupt Makes the engine stop on the first opportunity.
load_train_data Interrupts the engine and loads the train dataset.
load_test_data Interrupts the engine and loads the test dataset.
shuffle_train_data Interrupts the engine and shuffles the training data.
reconfigure Interrupts the engine and loads a new configuration.
restart Interrupts the engine and starts again from last checkpoint (if any).
request_status Queries the current status of the engine.
is_running()[source]

Tells whether the engine is running or not.

Return type:bool
Returns:True if the engine is running and False otherwise.
interrupt()[source]

Makes the engine stop on the first opportunity.

Note

This method is not asynchronous. It will wait until the engine stops.

load_train_data(train_data, target_column, restart=False)[source]

Interrupts the engine and loads the train dataset. All of its columns must be either instances of str or int.

Warning

Loading new training data will always trigger the loss of history for optimization.

Parameters:
  • train_data (pandas.DataFrame) – The training data.
  • target_column (str or int) – The target column identifier.
  • restart (bool, optional, default=False) – Whether to restart the engine after updating data or not.
Raises:

TypeError if train_data is not an instance of pandas.DataFrame.

Raises:

ValueError if target_column is not a column of train_data or if some column name is of a prohibited type.

load_test_data(test_data, restart=False)[source]

Interrupts the engine and loads the test dataset. All of its columns must be columns in the train data.

The test dataset is the one for which we don’t have the values for the target column. This method should be used to load data in production.

Warning

This method can only be called after miraiml.Engine.load_train_data()

Parameters:
  • test_data (pandas.DataFrame, optional, default=None) – The testing data. Use the default value if you don’t need to make predictions for data with unknown labels.
  • restart (bool, optional, default=False) – Whether to restart the engine after loading data or not.
Raises:

RuntimeError if this method is called before loading the train data.

Raises:

ValueError if the column names are not consistent.

clean_test_data(restart=False)[source]

Cleans the test data from the buffer.

Note

Keep in mind that if you don’t intend to make predictions for unlabeled data, the engine will run faster with a clean test data buffer.

Parameters:restart (bool, optional, default=False) – Whether to restart the engine after cleaning test data or not.
shuffle_train_data(restart=False)[source]

Interrupts the engine and shuffles the training data.

Parameters:restart (bool, optional, default=False) – Whether to restart the engine after shuffling data or not.
Raises:RuntimeError if the engine has no data loaded.

Note

It’s a good practice to shuffle the training data periodically to avoid overfitting on a particular folding pattern.

reconfigure(config, restart=False)[source]

Interrupts the engine and loads a new configuration.

Warning

Reconfiguring the engine will always trigger the loss of history for optimization.

Parameters:
  • config (miraiml.Config) – The configurations for the behavior of the engine.
  • restart (bool, optional, default=False) – Whether to restart the engine after reconfiguring it or not.
restart()[source]

Interrupts the engine and starts again from last checkpoint (if any). It is also used to start the engine for the first time.

Raises:RuntimeError if no data is loaded.
request_status()[source]

Queries the current status of the engine.

Return type:miraiml.Status
Returns:The current status of the engine in the form of a dictionary. If no score has been computed yet, returns None.

miraiml.Status

class miraiml.Status(**kwargs)[source]

Represents the current status of the engine. Objects of this class are not supposed to be instantiated by the user. Rather, they are returned by the miraiml.Engine.request_status() method.

The following attributes are accessible:

  • best_id: the id of the best base model (or ensemble)
  • scores: a dictionary containing the current score of each id
  • train_predictions: a pandas.DataFrame object containing the predictions for the train data for each id
  • test_predictions: a pandas.DataFrame object containing the predictions for the test data for each id
  • ensemble_weights: a dictionary containing the ensemble weights for each base model id
  • base_models: a dictionary containing the characteristics of each base model (accessed by its respective id)
  • histories: a dictionary of pandas.DataFrame objects for each id, containing the history of base models attempts and their respective scores. Hyperparameters columns end with the '__(hyperparameter)' suffix and features columns end with the '__(feature)' suffix. The score column can be accessed with the key 'score'. For more information, please check the User Guide.

The characteristics of each base model are represent by dictionaries, containing the following keys:

  • 'model_class': The name of the base model’s modeling class
  • 'parameters': The dictionary of hyperparameters values
  • 'features': The list of features used
build_report(include_features=False)[source]

Returns the report of the current status of the engine in a formatted string.

Parameters:include_features (bool, optional, default=False) – Whether to include the list of features on the report or not (may cause some visual mess).
Return type:str
Returns:The formatted report.

miraiml.pipeline

miraiml.pipeline contains a function that lets you build your own pipeline classes. It also contains a few pre-defined pipelines for baselines.

compose A function that defines pipeline classes dinamically.
NaiveBayesBaseliner This is a baseline pipeline for classification problems.
LinearRegressionBaseliner This is a baseline pipeline for regression problems.
miraiml.pipeline.compose(steps)[source]

A function that defines pipeline classes dinamically. It builds a pipeline class that can be instantiated with particular parameters for each of its transformers/estimator without needing to call set_params as you would do with scikit-learn’s Pipeline when performing hyperparameters optimizations.

Similarly to scikit-learn’s Pipeline, steps is a list of tuples containing an alias and the respective pipeline element. Although, since this function is a class factory, you shouldn’t instantiate the transformer/estimator as you would do with scikit-learn’s Pipeline. Thus, this is how compose() should be called:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.preprocessing import StandardScaler

>>> from miraiml.pipeline import compose

>>> MyPipelineClass = compose(
...     steps = [
...         ('scaler', StandardScaler), # StandardScaler instead of StandardScaler()
...         ('rfc', RandomForestClassifier) # No instantiation either
...     ]
... )

And then, in order to instantiate MyPipelineClass with the desired parameters, you just need to refer to them as a concatenation of their respective class aliases and their names, separated by '__'.

>>> pipeline = MyPipelineClass(scaler__with_mean=False, rfc__max_depth=3)

If you want to know which parameters you’re allowed to play with, just call get_params:

>>> params = pipeline.get_params()
>>> print("\n".join(params))
scaler__with_mean
scaler__with_std
rfc__bootstrap
rfc__class_weight
rfc__criterion
rfc__max_depth
rfc__max_features
rfc__max_leaf_nodes
rfc__min_impurity_decrease
rfc__min_impurity_split
rfc__min_samples_leaf
rfc__min_samples_split
rfc__min_weight_fraction_leaf
rfc__n_estimators
rfc__n_jobs
rfc__oob_score
rfc__random_state
rfc__verbose
rfc__warm_start

You can check the available methods for your instantiated pipelines on the documentation for miraiml.core.BasePipelineClass, which is the class from which the composed classes inherit from.

The intended purpose of such pipeline classes is that they can work as base models to build instances of miraiml.SearchSpace.

>>> from miraiml import SearchSpace

>>> search_space = SearchSpace(
...     id='MyPipelineClass',
...     model_class=MyPipelineClass,
...     parameters_values=dict(
...         scaler__with_mean=[True, False],
...         scaler__with_std=[True, False],
...         rfc__max_depth=[3, 4, 5, 6]
...     )
... )
Parameters:steps (list) –

The list of pairs (alias, class) to define the pipeline.

Warning

Repeated aliases are not allowed and none of the aliases can start with numbers or contain '__'.

The classes used to compose a pipeline must implement get_params and set_params, such as scikit-learn’s classes, or compose() will break.

Return type:type
Returns:The composed pipeline class
Raises:TypeError if an alias is not a string.
Raises:ValueError if an alias has an invalid name.
Raises:NotImplementedError if some class of the pipeline does not implement the required methods.
class miraiml.pipeline.NaiveBayesBaseliner[source]

This is a baseline pipeline for classification problems. It’s composed by the following transformers/estimator:

  1. sklearn.preprocessing.OneHotEncoder
  2. sklearn.impute.SimpleImputer
  3. sklearn.preprocessing.MinMaxScaler
  4. sklearn.naive_bayes.GaussianNB

The available parameters to tweak are:

>>> from miraiml.pipeline import NaiveBayesBaseliner

>>> for param in NaiveBayesBaseliner().get_params():
...     print(param)
...
ohe__categorical_features
ohe__categories
ohe__drop
ohe__dtype
ohe__handle_unknown
ohe__n_values
ohe__sparse
impute__add_indicator
impute__fill_value
impute__missing_values
impute__strategy
impute__verbose
min_max__feature_range
naive__priors
naive__var_smoothing
class miraiml.pipeline.LinearRegressionBaseliner[source]

This is a baseline pipeline for regression problems. It’s composed by the following transformers/estimator:

  1. sklearn.preprocessing.OneHotEncoder
  2. sklearn.impute.SimpleImputer
  3. sklearn.preprocessing.MinMaxScaler
  4. sklearn.linear_model.LinearRegression

The available parameters to tweak are:

>>> from miraiml.pipeline import LinearRegressionBaseliner

>>> for param in LinearRegressionBaseliner().get_params():
...     print(param)
...
ohe__categorical_features
ohe__categories
ohe__drop
ohe__dtype
ohe__handle_unknown
ohe__n_values
ohe__sparse
impute__add_indicator
impute__fill_value
impute__missing_values
impute__strategy
impute__verbose
min_max__feature_range
lin_reg__fit_intercept
lin_reg__n_jobs
lin_reg__normalize