The User’s API¶

miraiml provides the following components:

miraiml.SearchSpace represents the search space for a base model
miraiml.Config defines the general behavior for miraiml.Engine
miraiml.Engine manages the optimization process
miraiml.pipeline has some features related to pipelines (hot!)

miraiml.SearchSpace¶

class miraiml.SearchSpace(id, model_class, parameters_values=None, parameters_rules=<function SearchSpace.<lambda>>)[source]¶

This class represents the search space of hyperparameters for a base model.

Parameters:	id (str) – The id that will be associated with the models generated within this search space. model_class (type) – Any class that represents a statistical model. It must implement the methods `fit` as well as `predict` for regression or `predict_proba` for classification problems. parameters_values (dict, optional, default=None) – A dictionary containing lists of values to be tested as parameters when instantiating objects of `model_class` for `id`. parameters_rules (function, optional, default=lambda x: None) – A function that constrains certain parameters because of the values assumed by others. It must receive a dictionary as input and doesn’t need to return anything. Not used if `parameters_values` has no keys. Warning Make sure that the parameters accessed in `parameters_rules` exist in the set of parameters defined on `parameters_values`, otherwise the engine will attempt to access an invalid key.
Raises:	`NotImplementedError` if a model class does not implement `fit` or none of `predict` or `predict_proba`.
Raises:	`TypeError` if some parameter is of a prohibited type.
Raises:	`ValueError` if a provided `id` is not allowed.
Example:

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from miraiml import SearchSpace

>>> def logistic_regression_parameters_rules(parameters):
...     if parameters['solver'] in ['newton-cg', 'sag', 'lbfgs']:
...         parameters['penalty'] = 'l2'

>>> search_space = SearchSpace(
...     id = 'Logistic Regression',
...     model_class = LogisticRegression,
...     parameters_values = {
...         'penalty': ['l1', 'l2'],
...         'C': np.arange(0.1, 2, 0.1),
...         'max_iter': np.arange(50, 300),
...         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
...         'random_state': [0]
...     },
...     parameters_rules = logistic_regression_parameters_rules
... )

Warning

Do not allow random_state assume multiple values. If model_class has a random_state parameter, force the engine to always choose the same value by providing a list with a single element.

Allowing random_state to assume multiple values will confuse the engine because the scores will be unstable even with the same choice of hyperparameters and features.

miraiml.Config¶

class miraiml.Config(local_dir, problem_type, score_function, search_spaces, use_all_features=False, n_folds=5, stratified=True, ensemble_id=None, stagnation=60)[source]¶

This class defines the general behavior of the engine.

Parameters:	local_dir (str) – The name of the folder in which the engine will save its internal files. If the directory doesn’t exist, it will be created automatically. `..` and `/` are not allowed to compose `local_dir`. problem_type (str) – `'classification'` or `'regression'`. The problem type. Multi-class classification problems are not supported. search_spaces (list) – The list of `miraiml.SearchSpace` objects to optimize. If `search_spaces` has length 1, the engine will not run ensemble cycles. score_function (function) – A function that receives the “truth” and the predictions (in this order) and returns the score. Bigger scores must mean better models. use_all_features (bool, optional, default=False) – Whether to force MiraiML to always use all features or not. n_folds (int, optional, default=5) – The number of folds for the fitting/predicting process. The minimum value allowed is 2. stratified (bool, optional, default=True) – Whether to stratify folds on target or not. Only used if `problem_type == 'classification'`. ensemble_id (str, optional, default=None) – The id for the ensemble. If none is given, the engine will not ensemble base models. stagnation (int or float, optional, default=60) – The amount of time (in minutes) for the engine to automatically interrupt itself if no improvement happens. Negative numbers are interpreted as “infinite”. Warning Stagnation checks only happen after the engine finishes at least one optimization cycle. In other words, every base model and the ensemble (if set) must be scored at least once.
Raises:	`NotImplementedError` if a model class does not implement the proper method for prediction.
Raises:	`TypeError` if some parameter is not of its allowed type.
Raises:	`ValueError` if some parameter has an invalid value.
Example:

>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.tree import DecisionTreeClassifier
>>> from miraiml import SearchSpace, Config

>>> search_spaces = [
...     SearchSpace('Naive Bayes', GaussianNB),
...     SearchSpace('Decicion Tree', DecisionTreeClassifier)
... ]

>>> config = Config(
...     local_dir = 'miraiml_local',
...     problem_type = 'classification',
...     score_function = roc_auc_score,
...     search_spaces = search_spaces,
...     use_all_features = False,
...     n_folds = 5,
...     stratified = True,
...     ensemble_id = 'Ensemble',
...     stagnation = -1
... )

miraiml.Engine¶

class miraiml.Engine(config, on_improvement=None)[source]¶

This class offers the controls for the engine.

Parameters:	config (miraiml.Config) – The configurations for the behavior of the engine. on_improvement (function, optional, default=None) – A function that will be executed everytime the engine finds an improvement for some id. It must receive a `status` parameter, which is the return of the method `request_status()` (an instance of `miraiml.Status`).
Raises:	`TypeError` if `config` is not an instance of `miraiml.Config` or `on_improvement` (if provided) is not callable.
Example:

>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.tree import DecisionTreeClassifier
>>> from miraiml import SearchSpace, Config, Engine

>>> search_spaces = [
...     SearchSpace('Naive Bayes', GaussianNB),
...     SearchSpace('Decision Tree', DecisionTreeClassifier)
... ]

>>> config = Config(
...     local_dir = 'miraiml_local',
...     problem_type = 'classification',
...     score_function = roc_auc_score,
...     search_spaces = search_spaces,
...     ensemble_id = 'Ensemble'
... )

>>> def on_improvement(status):
...     print('Scores:', status.scores)

>>> engine = Engine(config, on_improvement=on_improvement)

`is_running`	Tells whether the engine is running or not.
`interrupt`	Makes the engine stop on the first opportunity.
`load_train_data`	Interrupts the engine and loads the train dataset.
`load_test_data`	Interrupts the engine and loads the test dataset.
`shuffle_train_data`	Interrupts the engine and shuffles the training data.
`reconfigure`	Interrupts the engine and loads a new configuration.
`restart`	Interrupts the engine and starts again from last checkpoint (if any).
`request_status`	Queries the current status of the engine.

is_running()[source]¶

Tells whether the engine is running or not.

Return type:	bool
Returns:	`True` if the engine is running and `False` otherwise.

interrupt()[source]¶: Makes the engine stop on the first opportunity.

Note

This method is not asynchronous. It will wait until the engine stops.

load_train_data(train_data, target_column, restart=False)[source]¶

Interrupts the engine and loads the train dataset. All of its columns must be either instances of str or int.

Warning

Loading new training data will always trigger the loss of history for optimization.

Parameters:	train_data (pandas.DataFrame) – The training data. target_column (str or int) – The target column identifier. restart (bool, optional, default=False) – Whether to restart the engine after updating data or not.
Raises:	`TypeError` if `train_data` is not an instance of `pandas.DataFrame`.
Raises:	`ValueError` if `target_column` is not a column of `train_data` or if some column name is of a prohibited type.

load_test_data(test_data, restart=False)[source]¶

Interrupts the engine and loads the test dataset. All of its columns must be columns in the train data.

The test dataset is the one for which we don’t have the values for the target column. This method should be used to load data in production.

Warning

This method can only be called after miraiml.Engine.load_train_data()

Parameters:	test_data (pandas.DataFrame, optional, default=None) – The testing data. Use the default value if you don’t need to make predictions for data with unknown labels. restart (bool, optional, default=False) – Whether to restart the engine after loading data or not.
Raises:	`RuntimeError` if this method is called before loading the train data.
Raises:	`ValueError` if the column names are not consistent.

clean_test_data(restart=False)[source]¶

Cleans the test data from the buffer.

Note

Keep in mind that if you don’t intend to make predictions for unlabeled data, the engine will run faster with a clean test data buffer.

Parameters:	restart (bool, optional, default=False) – Whether to restart the engine after cleaning test data or not.

shuffle_train_data(restart=False)[source]¶

Interrupts the engine and shuffles the training data.

Parameters:	restart (bool, optional, default=False) – Whether to restart the engine after shuffling data or not.
Raises:	`RuntimeError` if the engine has no data loaded.

Note

It’s a good practice to shuffle the training data periodically to avoid overfitting on a particular folding pattern.

reconfigure(config, restart=False)[source]¶

Interrupts the engine and loads a new configuration.

Warning

Reconfiguring the engine will always trigger the loss of history for optimization.

Parameters:	config (miraiml.Config) – The configurations for the behavior of the engine. restart (bool, optional, default=False) – Whether to restart the engine after reconfiguring it or not.

restart()[source]¶

Interrupts the engine and starts again from last checkpoint (if any). It is also used to start the engine for the first time.

Raises:	`RuntimeError` if no data is loaded.

request_status()[source]¶

Queries the current status of the engine.

Return type:	miraiml.Status
Returns:	The current status of the engine in the form of a dictionary. If no score has been computed yet, returns `None`.

miraiml.Status¶

class miraiml.Status(**kwargs)[source]¶

Represents the current status of the engine. Objects of this class are not supposed to be instantiated by the user. Rather, they are returned by the miraiml.Engine.request_status() method.

The following attributes are accessible:

best_id: the id of the best base model (or ensemble)
scores: a dictionary containing the current score of each id
train_predictions: a pandas.DataFrame object containing the predictions for the train data for each id
test_predictions: a pandas.DataFrame object containing the predictions for the test data for each id
ensemble_weights: a dictionary containing the ensemble weights for each base model id
base_models: a dictionary containing the characteristics of each base model (accessed by its respective id)
histories: a dictionary of pandas.DataFrame objects for each id, containing the history of base models attempts and their respective scores. Hyperparameters columns end with the '__(hyperparameter)' suffix and features columns end with the '__(feature)' suffix. The score column can be accessed with the key 'score'. For more information, please check the User Guide.

The characteristics of each base model are represent by dictionaries, containing the following keys:

'model_class': The name of the base model’s modeling class
'parameters': The dictionary of hyperparameters values
'features': The list of features used

build_report(include_features=False)[source]¶

Returns the report of the current status of the engine in a formatted string.

Parameters:	include_features (bool, optional, default=False) – Whether to include the list of features on the report or not (may cause some visual mess).
Return type:	str
Returns:	The formatted report.

miraiml.pipeline¶

miraiml.pipeline contains a function that lets you build your own pipeline classes. It also contains a few pre-defined pipelines for baselines.

`compose`	A function that defines pipeline classes dinamically.
`NaiveBayesBaseliner`	This is a baseline pipeline for classification problems.
`LinearRegressionBaseliner`	This is a baseline pipeline for regression problems.

miraiml.pipeline.compose(steps)[source]¶

A function that defines pipeline classes dinamically. It builds a pipeline class that can be instantiated with particular parameters for each of its transformers/estimator without needing to call set_params as you would do with scikit-learn’s Pipeline when performing hyperparameters optimizations.

Similarly to scikit-learn’s Pipeline, steps is a list of tuples containing an alias and the respective pipeline element. Although, since this function is a class factory, you shouldn’t instantiate the transformer/estimator as you would do with scikit-learn’s Pipeline. Thus, this is how compose() should be called:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.preprocessing import StandardScaler

>>> from miraiml.pipeline import compose

>>> MyPipelineClass = compose(
...     steps = [
...         ('scaler', StandardScaler), # StandardScaler instead of StandardScaler()
...         ('rfc', RandomForestClassifier) # No instantiation either
...     ]
... )

And then, in order to instantiate MyPipelineClass with the desired parameters, you just need to refer to them as a concatenation of their respective class aliases and their names, separated by '__'.

>>> pipeline = MyPipelineClass(scaler__with_mean=False, rfc__max_depth=3)

If you want to know which parameters you’re allowed to play with, just call get_params:

>>> params = pipeline.get_params()
>>> print("\n".join(params))
scaler__with_mean
scaler__with_std
rfc__bootstrap
rfc__class_weight
rfc__criterion
rfc__max_depth
rfc__max_features
rfc__max_leaf_nodes
rfc__min_impurity_decrease
rfc__min_impurity_split
rfc__min_samples_leaf
rfc__min_samples_split
rfc__min_weight_fraction_leaf
rfc__n_estimators
rfc__n_jobs
rfc__oob_score
rfc__random_state
rfc__verbose
rfc__warm_start

You can check the available methods for your instantiated pipelines on the documentation for miraiml.core.BasePipelineClass, which is the class from which the composed classes inherit from.

The intended purpose of such pipeline classes is that they can work as base models to build instances of miraiml.SearchSpace.

>>> from miraiml import SearchSpace

>>> search_space = SearchSpace(
...     id='MyPipelineClass',
...     model_class=MyPipelineClass,
...     parameters_values=dict(
...         scaler__with_mean=[True, False],
...         scaler__with_std=[True, False],
...         rfc__max_depth=[3, 4, 5, 6]
...     )
... )

Parameters:	steps (list) – The list of pairs (alias, class) to define the pipeline. Warning Repeated aliases are not allowed and none of the aliases can start with numbers or contain `'__'`. The classes used to compose a pipeline must implement `get_params` and `set_params`, such as scikit-learn’s classes, or `compose()` will break.
Return type:	type
Returns:	The composed pipeline class
Raises:	`TypeError` if an alias is not a string.
Raises:	`ValueError` if an alias has an invalid name.
Raises:	`NotImplementedError` if some class of the pipeline does not implement the required methods.

class miraiml.pipeline.NaiveBayesBaseliner[source]¶

This is a baseline pipeline for classification problems. It’s composed by the following transformers/estimator:

sklearn.preprocessing.OneHotEncoder
sklearn.impute.SimpleImputer
sklearn.preprocessing.MinMaxScaler
sklearn.naive_bayes.GaussianNB

The available parameters to tweak are:

>>> from miraiml.pipeline import NaiveBayesBaseliner

>>> for param in NaiveBayesBaseliner().get_params():
...     print(param)
...
ohe__categorical_features
ohe__categories
ohe__drop
ohe__dtype
ohe__handle_unknown
ohe__n_values
ohe__sparse
impute__add_indicator
impute__fill_value
impute__missing_values
impute__strategy
impute__verbose
min_max__feature_range
naive__priors
naive__var_smoothing

class miraiml.pipeline.LinearRegressionBaseliner[source]¶

This is a baseline pipeline for regression problems. It’s composed by the following transformers/estimator:

sklearn.preprocessing.OneHotEncoder
sklearn.impute.SimpleImputer
sklearn.preprocessing.MinMaxScaler
sklearn.linear_model.LinearRegression

The available parameters to tweak are:

>>> from miraiml.pipeline import LinearRegressionBaseliner

>>> for param in LinearRegressionBaseliner().get_params():
...     print(param)
...
ohe__categorical_features
ohe__categories
ohe__drop
ohe__dtype
ohe__handle_unknown
ohe__n_values
ohe__sparse
impute__add_indicator
impute__fill_value
impute__missing_values
impute__strategy
impute__verbose
min_max__feature_range
lin_reg__fit_intercept
lin_reg__n_jobs
lin_reg__normalize