The User’s API¶
miraiml provides the following components:
miraiml.SearchSpacerepresents the search space for a base modelmiraiml.Configdefines the general behavior formiraiml.Enginemiraiml.Enginemanages the optimization processmiraiml.pipelinehas some features related to pipelines (hot!)
miraiml.SearchSpace¶
-
class
miraiml.SearchSpace(id, model_class, parameters_values=None, parameters_rules=<function SearchSpace.<lambda>>)[source]¶ This class represents the search space of hyperparameters for a base model.
Parameters: - id (str) – The id that will be associated with the models generated within this search space.
- model_class (type) – Any class that represents a statistical model. It must
implement the methods
fitas well aspredictfor regression orpredict_probafor classification problems. - parameters_values (dict, optional, default=None) – A dictionary containing lists of values to be
tested as parameters when instantiating objects of
model_classforid. - parameters_rules (function, optional, default=lambda x: None) –
A function that constrains certain parameters because of the values assumed by others. It must receive a dictionary as input and doesn’t need to return anything. Not used if
parameters_valueshas no keys.Warning
Make sure that the parameters accessed in
parameters_rulesexist in the set of parameters defined onparameters_values, otherwise the engine will attempt to access an invalid key.
Raises: NotImplementedErrorif a model class does not implementfitor none ofpredictorpredict_proba.Raises: TypeErrorif some parameter is of a prohibited type.Raises: ValueErrorif a providedidis not allowed.Example: >>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from miraiml import SearchSpace >>> def logistic_regression_parameters_rules(parameters): ... if parameters['solver'] in ['newton-cg', 'sag', 'lbfgs']: ... parameters['penalty'] = 'l2' >>> search_space = SearchSpace( ... id = 'Logistic Regression', ... model_class = LogisticRegression, ... parameters_values = { ... 'penalty': ['l1', 'l2'], ... 'C': np.arange(0.1, 2, 0.1), ... 'max_iter': np.arange(50, 300), ... 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], ... 'random_state': [0] ... }, ... parameters_rules = logistic_regression_parameters_rules ... )
Warning
Do not allow
random_stateassume multiple values. Ifmodel_classhas arandom_stateparameter, force the engine to always choose the same value by providing a list with a single element.Allowing
random_stateto assume multiple values will confuse the engine because the scores will be unstable even with the same choice of hyperparameters and features.
miraiml.Config¶
-
class
miraiml.Config(local_dir, problem_type, score_function, search_spaces, use_all_features=False, n_folds=5, stratified=True, ensemble_id=None, stagnation=60)[source]¶ This class defines the general behavior of the engine.
Parameters: - local_dir (str) – The name of the folder in which the engine will save its
internal files. If the directory doesn’t exist, it will be created
automatically.
..and/are not allowed to composelocal_dir. - problem_type (str) –
'classification'or'regression'. The problem type. Multi-class classification problems are not supported. - search_spaces (list) – The list of
miraiml.SearchSpaceobjects to optimize. Ifsearch_spaceshas length 1, the engine will not run ensemble cycles. - score_function (function) – A function that receives the “truth” and the predictions (in this order) and returns the score. Bigger scores must mean better models.
- use_all_features (bool, optional, default=False) – Whether to force MiraiML to always use all features or not.
- n_folds (int, optional, default=5) – The number of folds for the fitting/predicting process. The minimum value allowed is 2.
- stratified (bool, optional, default=True) – Whether to stratify folds on target or not. Only used if
problem_type == 'classification'. - ensemble_id (str, optional, default=None) – The id for the ensemble. If none is given, the engine will not ensemble base models.
- stagnation (int or float, optional, default=60) –
The amount of time (in minutes) for the engine to automatically interrupt itself if no improvement happens. Negative numbers are interpreted as “infinite”.
Warning
Stagnation checks only happen after the engine finishes at least one optimization cycle. In other words, every base model and the ensemble (if set) must be scored at least once.
Raises: NotImplementedErrorif a model class does not implement the proper method for prediction.Raises: TypeErrorif some parameter is not of its allowed type.Raises: ValueErrorif some parameter has an invalid value.Example: >>> from sklearn.metrics import roc_auc_score >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.tree import DecisionTreeClassifier >>> from miraiml import SearchSpace, Config >>> search_spaces = [ ... SearchSpace('Naive Bayes', GaussianNB), ... SearchSpace('Decicion Tree', DecisionTreeClassifier) ... ] >>> config = Config( ... local_dir = 'miraiml_local', ... problem_type = 'classification', ... score_function = roc_auc_score, ... search_spaces = search_spaces, ... use_all_features = False, ... n_folds = 5, ... stratified = True, ... ensemble_id = 'Ensemble', ... stagnation = -1 ... )
- local_dir (str) – The name of the folder in which the engine will save its
internal files. If the directory doesn’t exist, it will be created
automatically.
miraiml.Engine¶
-
class
miraiml.Engine(config, on_improvement=None)[source]¶ This class offers the controls for the engine.
Parameters: - config (miraiml.Config) – The configurations for the behavior of the engine.
- on_improvement (function, optional, default=None) – A function that will be executed everytime the engine
finds an improvement for some id. It must receive a
statusparameter, which is the return of the methodrequest_status()(an instance ofmiraiml.Status).
Raises: TypeErrorifconfigis not an instance ofmiraiml.Configoron_improvement(if provided) is not callable.Example: >>> from sklearn.metrics import roc_auc_score >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.tree import DecisionTreeClassifier >>> from miraiml import SearchSpace, Config, Engine >>> search_spaces = [ ... SearchSpace('Naive Bayes', GaussianNB), ... SearchSpace('Decision Tree', DecisionTreeClassifier) ... ] >>> config = Config( ... local_dir = 'miraiml_local', ... problem_type = 'classification', ... score_function = roc_auc_score, ... search_spaces = search_spaces, ... ensemble_id = 'Ensemble' ... ) >>> def on_improvement(status): ... print('Scores:', status.scores) >>> engine = Engine(config, on_improvement=on_improvement)
is_runningTells whether the engine is running or not. interruptMakes the engine stop on the first opportunity. load_train_dataInterrupts the engine and loads the train dataset. load_test_dataInterrupts the engine and loads the test dataset. shuffle_train_dataInterrupts the engine and shuffles the training data. reconfigureInterrupts the engine and loads a new configuration. restartInterrupts the engine and starts again from last checkpoint (if any). request_statusQueries the current status of the engine. -
is_running()[source]¶ Tells whether the engine is running or not.
Return type: bool Returns: Trueif the engine is running andFalseotherwise.
-
interrupt()[source]¶ Makes the engine stop on the first opportunity.
Note
This method is not asynchronous. It will wait until the engine stops.
-
load_train_data(train_data, target_column, restart=False)[source]¶ Interrupts the engine and loads the train dataset. All of its columns must be either instances of
strorint.Warning
Loading new training data will always trigger the loss of history for optimization.
Parameters: - train_data (pandas.DataFrame) – The training data.
- target_column (str or int) – The target column identifier.
- restart (bool, optional, default=False) – Whether to restart the engine after updating data or not.
Raises: TypeErroriftrain_datais not an instance ofpandas.DataFrame.Raises: ValueErroriftarget_columnis not a column oftrain_dataor if some column name is of a prohibited type.
-
load_test_data(test_data, restart=False)[source]¶ Interrupts the engine and loads the test dataset. All of its columns must be columns in the train data.
The test dataset is the one for which we don’t have the values for the target column. This method should be used to load data in production.
Warning
This method can only be called after
miraiml.Engine.load_train_data()Parameters: - test_data (pandas.DataFrame, optional, default=None) – The testing data. Use the default value if you don’t need to make predictions for data with unknown labels.
- restart (bool, optional, default=False) – Whether to restart the engine after loading data or not.
Raises: RuntimeErrorif this method is called before loading the train data.Raises: ValueErrorif the column names are not consistent.
-
clean_test_data(restart=False)[source]¶ Cleans the test data from the buffer.
Note
Keep in mind that if you don’t intend to make predictions for unlabeled data, the engine will run faster with a clean test data buffer.
Parameters: restart (bool, optional, default=False) – Whether to restart the engine after cleaning test data or not.
-
shuffle_train_data(restart=False)[source]¶ Interrupts the engine and shuffles the training data.
Parameters: restart (bool, optional, default=False) – Whether to restart the engine after shuffling data or not. Raises: RuntimeErrorif the engine has no data loaded.Note
It’s a good practice to shuffle the training data periodically to avoid overfitting on a particular folding pattern.
-
reconfigure(config, restart=False)[source]¶ Interrupts the engine and loads a new configuration.
Warning
Reconfiguring the engine will always trigger the loss of history for optimization.
Parameters: - config (miraiml.Config) – The configurations for the behavior of the engine.
- restart (bool, optional, default=False) – Whether to restart the engine after reconfiguring it or not.
-
restart()[source]¶ Interrupts the engine and starts again from last checkpoint (if any). It is also used to start the engine for the first time.
Raises: RuntimeErrorif no data is loaded.
-
request_status()[source]¶ Queries the current status of the engine.
Return type: miraiml.Status Returns: The current status of the engine in the form of a dictionary. If no score has been computed yet, returns None.
miraiml.Status¶
-
class
miraiml.Status(**kwargs)[source]¶ Represents the current status of the engine. Objects of this class are not supposed to be instantiated by the user. Rather, they are returned by the
miraiml.Engine.request_status()method.The following attributes are accessible:
best_id: the id of the best base model (or ensemble)scores: a dictionary containing the current score of each idtrain_predictions: apandas.DataFrameobject containing the predictions for the train data for each idtest_predictions: apandas.DataFrameobject containing the predictions for the test data for each idensemble_weights: a dictionary containing the ensemble weights for each base model idbase_models: a dictionary containing the characteristics of each base model (accessed by its respective id)histories: a dictionary ofpandas.DataFrameobjects for each id, containing the history of base models attempts and their respective scores. Hyperparameters columns end with the'__(hyperparameter)'suffix and features columns end with the'__(feature)'suffix. The score column can be accessed with the key'score'. For more information, please check the User Guide.
The characteristics of each base model are represent by dictionaries, containing the following keys:
'model_class': The name of the base model’s modeling class'parameters': The dictionary of hyperparameters values'features': The list of features used
-
build_report(include_features=False)[source]¶ Returns the report of the current status of the engine in a formatted string.
Parameters: include_features (bool, optional, default=False) – Whether to include the list of features on the report or not (may cause some visual mess). Return type: str Returns: The formatted report.
miraiml.pipeline¶
miraiml.pipeline contains a function that lets you build your own
pipeline classes. It also contains a few pre-defined pipelines for baselines.
compose |
A function that defines pipeline classes dinamically. |
NaiveBayesBaseliner |
This is a baseline pipeline for classification problems. |
LinearRegressionBaseliner |
This is a baseline pipeline for regression problems. |
-
miraiml.pipeline.compose(steps)[source]¶ A function that defines pipeline classes dinamically. It builds a pipeline class that can be instantiated with particular parameters for each of its transformers/estimator without needing to call
set_paramsas you would do with scikit-learn’s Pipeline when performing hyperparameters optimizations.Similarly to scikit-learn’s Pipeline,
stepsis a list of tuples containing an alias and the respective pipeline element. Although, since this function is a class factory, you shouldn’t instantiate the transformer/estimator as you would do with scikit-learn’s Pipeline. Thus, this is howcompose()should be called:>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.preprocessing import StandardScaler >>> from miraiml.pipeline import compose >>> MyPipelineClass = compose( ... steps = [ ... ('scaler', StandardScaler), # StandardScaler instead of StandardScaler() ... ('rfc', RandomForestClassifier) # No instantiation either ... ] ... )
And then, in order to instantiate
MyPipelineClasswith the desired parameters, you just need to refer to them as a concatenation of their respective class aliases and their names, separated by'__'.>>> pipeline = MyPipelineClass(scaler__with_mean=False, rfc__max_depth=3)
If you want to know which parameters you’re allowed to play with, just call
get_params:>>> params = pipeline.get_params() >>> print("\n".join(params)) scaler__with_mean scaler__with_std rfc__bootstrap rfc__class_weight rfc__criterion rfc__max_depth rfc__max_features rfc__max_leaf_nodes rfc__min_impurity_decrease rfc__min_impurity_split rfc__min_samples_leaf rfc__min_samples_split rfc__min_weight_fraction_leaf rfc__n_estimators rfc__n_jobs rfc__oob_score rfc__random_state rfc__verbose rfc__warm_start
You can check the available methods for your instantiated pipelines on the documentation for
miraiml.core.BasePipelineClass, which is the class from which the composed classes inherit from.The intended purpose of such pipeline classes is that they can work as base models to build instances of
miraiml.SearchSpace.>>> from miraiml import SearchSpace >>> search_space = SearchSpace( ... id='MyPipelineClass', ... model_class=MyPipelineClass, ... parameters_values=dict( ... scaler__with_mean=[True, False], ... scaler__with_std=[True, False], ... rfc__max_depth=[3, 4, 5, 6] ... ) ... )
Parameters: steps (list) – The list of pairs (alias, class) to define the pipeline.
Warning
Repeated aliases are not allowed and none of the aliases can start with numbers or contain
'__'.The classes used to compose a pipeline must implement
get_paramsandset_params, such as scikit-learn’s classes, orcompose()will break.Return type: type Returns: The composed pipeline class Raises: TypeErrorif an alias is not a string.Raises: ValueErrorif an alias has an invalid name.Raises: NotImplementedErrorif some class of the pipeline does not implement the required methods.
-
class
miraiml.pipeline.NaiveBayesBaseliner[source]¶ This is a baseline pipeline for classification problems. It’s composed by the following transformers/estimator:
sklearn.preprocessing.OneHotEncodersklearn.impute.SimpleImputersklearn.preprocessing.MinMaxScalersklearn.naive_bayes.GaussianNB
The available parameters to tweak are:
>>> from miraiml.pipeline import NaiveBayesBaseliner >>> for param in NaiveBayesBaseliner().get_params(): ... print(param) ... ohe__categorical_features ohe__categories ohe__drop ohe__dtype ohe__handle_unknown ohe__n_values ohe__sparse impute__add_indicator impute__fill_value impute__missing_values impute__strategy impute__verbose min_max__feature_range naive__priors naive__var_smoothing
-
class
miraiml.pipeline.LinearRegressionBaseliner[source]¶ This is a baseline pipeline for regression problems. It’s composed by the following transformers/estimator:
sklearn.preprocessing.OneHotEncodersklearn.impute.SimpleImputersklearn.preprocessing.MinMaxScalersklearn.linear_model.LinearRegression
The available parameters to tweak are:
>>> from miraiml.pipeline import LinearRegressionBaseliner >>> for param in LinearRegressionBaseliner().get_params(): ... print(param) ... ohe__categorical_features ohe__categories ohe__drop ohe__dtype ohe__handle_unknown ohe__n_values ohe__sparse impute__add_indicator impute__fill_value impute__missing_values impute__strategy impute__verbose min_max__feature_range lin_reg__fit_intercept lin_reg__n_jobs lin_reg__normalize