The User’s API¶
miraiml
provides the following components:
miraiml.SearchSpace
represents the search space for a base modelmiraiml.Config
defines the general behavior formiraiml.Engine
miraiml.Engine
manages the optimization processmiraiml.pipeline
has some features related to pipelines (hot!)
miraiml.SearchSpace¶
-
class
miraiml.
SearchSpace
(id, model_class, parameters_values=None, parameters_rules=<function SearchSpace.<lambda>>)[source]¶ This class represents the search space of hyperparameters for a base model.
Parameters: - id (str) – The id that will be associated with the models generated within this search space.
- model_class (type) – Any class that represents a statistical model. It must
implement the methods
fit
as well aspredict
for regression orpredict_proba
for classification problems. - parameters_values (dict, optional, default=None) – A dictionary containing lists of values to be
tested as parameters when instantiating objects of
model_class
forid
. - parameters_rules (function, optional, default=lambda x: None) –
A function that constrains certain parameters because of the values assumed by others. It must receive a dictionary as input and doesn’t need to return anything. Not used if
parameters_values
has no keys.Warning
Make sure that the parameters accessed in
parameters_rules
exist in the set of parameters defined onparameters_values
, otherwise the engine will attempt to access an invalid key.
Raises: NotImplementedError
if a model class does not implementfit
or none ofpredict
orpredict_proba
.Raises: TypeError
if some parameter is of a prohibited type.Raises: ValueError
if a providedid
is not allowed.Example: >>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from miraiml import SearchSpace >>> def logistic_regression_parameters_rules(parameters): ... if parameters['solver'] in ['newton-cg', 'sag', 'lbfgs']: ... parameters['penalty'] = 'l2' >>> search_space = SearchSpace( ... id = 'Logistic Regression', ... model_class = LogisticRegression, ... parameters_values = { ... 'penalty': ['l1', 'l2'], ... 'C': np.arange(0.1, 2, 0.1), ... 'max_iter': np.arange(50, 300), ... 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], ... 'random_state': [0] ... }, ... parameters_rules = logistic_regression_parameters_rules ... )
Warning
Do not allow
random_state
assume multiple values. Ifmodel_class
has arandom_state
parameter, force the engine to always choose the same value by providing a list with a single element.Allowing
random_state
to assume multiple values will confuse the engine because the scores will be unstable even with the same choice of hyperparameters and features.
miraiml.Config¶
-
class
miraiml.
Config
(local_dir, problem_type, score_function, search_spaces, use_all_features=False, n_folds=5, stratified=True, ensemble_id=None, stagnation=60)[source]¶ This class defines the general behavior of the engine.
Parameters: - local_dir (str) – The name of the folder in which the engine will save its
internal files. If the directory doesn’t exist, it will be created
automatically.
..
and/
are not allowed to composelocal_dir
. - problem_type (str) –
'classification'
or'regression'
. The problem type. Multi-class classification problems are not supported. - search_spaces (list) – The list of
miraiml.SearchSpace
objects to optimize. Ifsearch_spaces
has length 1, the engine will not run ensemble cycles. - score_function (function) – A function that receives the “truth” and the predictions (in this order) and returns the score. Bigger scores must mean better models.
- use_all_features (bool, optional, default=False) – Whether to force MiraiML to always use all features or not.
- n_folds (int, optional, default=5) – The number of folds for the fitting/predicting process. The minimum value allowed is 2.
- stratified (bool, optional, default=True) – Whether to stratify folds on target or not. Only used if
problem_type == 'classification'
. - ensemble_id (str, optional, default=None) – The id for the ensemble. If none is given, the engine will not ensemble base models.
- stagnation (int or float, optional, default=60) –
The amount of time (in minutes) for the engine to automatically interrupt itself if no improvement happens. Negative numbers are interpreted as “infinite”.
Warning
Stagnation checks only happen after the engine finishes at least one optimization cycle. In other words, every base model and the ensemble (if set) must be scored at least once.
Raises: NotImplementedError
if a model class does not implement the proper method for prediction.Raises: TypeError
if some parameter is not of its allowed type.Raises: ValueError
if some parameter has an invalid value.Example: >>> from sklearn.metrics import roc_auc_score >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.tree import DecisionTreeClassifier >>> from miraiml import SearchSpace, Config >>> search_spaces = [ ... SearchSpace('Naive Bayes', GaussianNB), ... SearchSpace('Decicion Tree', DecisionTreeClassifier) ... ] >>> config = Config( ... local_dir = 'miraiml_local', ... problem_type = 'classification', ... score_function = roc_auc_score, ... search_spaces = search_spaces, ... use_all_features = False, ... n_folds = 5, ... stratified = True, ... ensemble_id = 'Ensemble', ... stagnation = -1 ... )
- local_dir (str) – The name of the folder in which the engine will save its
internal files. If the directory doesn’t exist, it will be created
automatically.
miraiml.Engine¶
-
class
miraiml.
Engine
(config, on_improvement=None)[source]¶ This class offers the controls for the engine.
Parameters: - config (miraiml.Config) – The configurations for the behavior of the engine.
- on_improvement (function, optional, default=None) – A function that will be executed everytime the engine
finds an improvement for some id. It must receive a
status
parameter, which is the return of the methodrequest_status()
(an instance ofmiraiml.Status
).
Raises: TypeError
ifconfig
is not an instance ofmiraiml.Config
oron_improvement
(if provided) is not callable.Example: >>> from sklearn.metrics import roc_auc_score >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.tree import DecisionTreeClassifier >>> from miraiml import SearchSpace, Config, Engine >>> search_spaces = [ ... SearchSpace('Naive Bayes', GaussianNB), ... SearchSpace('Decision Tree', DecisionTreeClassifier) ... ] >>> config = Config( ... local_dir = 'miraiml_local', ... problem_type = 'classification', ... score_function = roc_auc_score, ... search_spaces = search_spaces, ... ensemble_id = 'Ensemble' ... ) >>> def on_improvement(status): ... print('Scores:', status.scores) >>> engine = Engine(config, on_improvement=on_improvement)
is_running
Tells whether the engine is running or not. interrupt
Makes the engine stop on the first opportunity. load_train_data
Interrupts the engine and loads the train dataset. load_test_data
Interrupts the engine and loads the test dataset. shuffle_train_data
Interrupts the engine and shuffles the training data. reconfigure
Interrupts the engine and loads a new configuration. restart
Interrupts the engine and starts again from last checkpoint (if any). request_status
Queries the current status of the engine. -
is_running
()[source]¶ Tells whether the engine is running or not.
Return type: bool Returns: True
if the engine is running andFalse
otherwise.
-
interrupt
()[source]¶ Makes the engine stop on the first opportunity.
Note
This method is not asynchronous. It will wait until the engine stops.
-
load_train_data
(train_data, target_column, restart=False)[source]¶ Interrupts the engine and loads the train dataset. All of its columns must be either instances of
str
orint
.Warning
Loading new training data will always trigger the loss of history for optimization.
Parameters: - train_data (pandas.DataFrame) – The training data.
- target_column (str or int) – The target column identifier.
- restart (bool, optional, default=False) – Whether to restart the engine after updating data or not.
Raises: TypeError
iftrain_data
is not an instance ofpandas.DataFrame
.Raises: ValueError
iftarget_column
is not a column oftrain_data
or if some column name is of a prohibited type.
-
load_test_data
(test_data, restart=False)[source]¶ Interrupts the engine and loads the test dataset. All of its columns must be columns in the train data.
The test dataset is the one for which we don’t have the values for the target column. This method should be used to load data in production.
Warning
This method can only be called after
miraiml.Engine.load_train_data()
Parameters: - test_data (pandas.DataFrame, optional, default=None) – The testing data. Use the default value if you don’t need to make predictions for data with unknown labels.
- restart (bool, optional, default=False) – Whether to restart the engine after loading data or not.
Raises: RuntimeError
if this method is called before loading the train data.Raises: ValueError
if the column names are not consistent.
-
clean_test_data
(restart=False)[source]¶ Cleans the test data from the buffer.
Note
Keep in mind that if you don’t intend to make predictions for unlabeled data, the engine will run faster with a clean test data buffer.
Parameters: restart (bool, optional, default=False) – Whether to restart the engine after cleaning test data or not.
-
shuffle_train_data
(restart=False)[source]¶ Interrupts the engine and shuffles the training data.
Parameters: restart (bool, optional, default=False) – Whether to restart the engine after shuffling data or not. Raises: RuntimeError
if the engine has no data loaded.Note
It’s a good practice to shuffle the training data periodically to avoid overfitting on a particular folding pattern.
-
reconfigure
(config, restart=False)[source]¶ Interrupts the engine and loads a new configuration.
Warning
Reconfiguring the engine will always trigger the loss of history for optimization.
Parameters: - config (miraiml.Config) – The configurations for the behavior of the engine.
- restart (bool, optional, default=False) – Whether to restart the engine after reconfiguring it or not.
-
restart
()[source]¶ Interrupts the engine and starts again from last checkpoint (if any). It is also used to start the engine for the first time.
Raises: RuntimeError
if no data is loaded.
-
request_status
()[source]¶ Queries the current status of the engine.
Return type: miraiml.Status Returns: The current status of the engine in the form of a dictionary. If no score has been computed yet, returns None
.
miraiml.Status¶
-
class
miraiml.
Status
(**kwargs)[source]¶ Represents the current status of the engine. Objects of this class are not supposed to be instantiated by the user. Rather, they are returned by the
miraiml.Engine.request_status()
method.The following attributes are accessible:
best_id
: the id of the best base model (or ensemble)scores
: a dictionary containing the current score of each idtrain_predictions
: apandas.DataFrame
object containing the predictions for the train data for each idtest_predictions
: apandas.DataFrame
object containing the predictions for the test data for each idensemble_weights
: a dictionary containing the ensemble weights for each base model idbase_models
: a dictionary containing the characteristics of each base model (accessed by its respective id)histories
: a dictionary ofpandas.DataFrame
objects for each id, containing the history of base models attempts and their respective scores. Hyperparameters columns end with the'__(hyperparameter)'
suffix and features columns end with the'__(feature)'
suffix. The score column can be accessed with the key'score'
. For more information, please check the User Guide.
The characteristics of each base model are represent by dictionaries, containing the following keys:
'model_class'
: The name of the base model’s modeling class'parameters'
: The dictionary of hyperparameters values'features'
: The list of features used
-
build_report
(include_features=False)[source]¶ Returns the report of the current status of the engine in a formatted string.
Parameters: include_features (bool, optional, default=False) – Whether to include the list of features on the report or not (may cause some visual mess). Return type: str Returns: The formatted report.
miraiml.pipeline¶
miraiml.pipeline
contains a function that lets you build your own
pipeline classes. It also contains a few pre-defined pipelines for baselines.
compose |
A function that defines pipeline classes dinamically. |
NaiveBayesBaseliner |
This is a baseline pipeline for classification problems. |
LinearRegressionBaseliner |
This is a baseline pipeline for regression problems. |
-
miraiml.pipeline.
compose
(steps)[source]¶ A function that defines pipeline classes dinamically. It builds a pipeline class that can be instantiated with particular parameters for each of its transformers/estimator without needing to call
set_params
as you would do with scikit-learn’s Pipeline when performing hyperparameters optimizations.Similarly to scikit-learn’s Pipeline,
steps
is a list of tuples containing an alias and the respective pipeline element. Although, since this function is a class factory, you shouldn’t instantiate the transformer/estimator as you would do with scikit-learn’s Pipeline. Thus, this is howcompose()
should be called:>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.preprocessing import StandardScaler >>> from miraiml.pipeline import compose >>> MyPipelineClass = compose( ... steps = [ ... ('scaler', StandardScaler), # StandardScaler instead of StandardScaler() ... ('rfc', RandomForestClassifier) # No instantiation either ... ] ... )
And then, in order to instantiate
MyPipelineClass
with the desired parameters, you just need to refer to them as a concatenation of their respective class aliases and their names, separated by'__'
.>>> pipeline = MyPipelineClass(scaler__with_mean=False, rfc__max_depth=3)
If you want to know which parameters you’re allowed to play with, just call
get_params
:>>> params = pipeline.get_params() >>> print("\n".join(params)) scaler__with_mean scaler__with_std rfc__bootstrap rfc__class_weight rfc__criterion rfc__max_depth rfc__max_features rfc__max_leaf_nodes rfc__min_impurity_decrease rfc__min_impurity_split rfc__min_samples_leaf rfc__min_samples_split rfc__min_weight_fraction_leaf rfc__n_estimators rfc__n_jobs rfc__oob_score rfc__random_state rfc__verbose rfc__warm_start
You can check the available methods for your instantiated pipelines on the documentation for
miraiml.core.BasePipelineClass
, which is the class from which the composed classes inherit from.The intended purpose of such pipeline classes is that they can work as base models to build instances of
miraiml.SearchSpace
.>>> from miraiml import SearchSpace >>> search_space = SearchSpace( ... id='MyPipelineClass', ... model_class=MyPipelineClass, ... parameters_values=dict( ... scaler__with_mean=[True, False], ... scaler__with_std=[True, False], ... rfc__max_depth=[3, 4, 5, 6] ... ) ... )
Parameters: steps (list) – The list of pairs (alias, class) to define the pipeline.
Warning
Repeated aliases are not allowed and none of the aliases can start with numbers or contain
'__'
.The classes used to compose a pipeline must implement
get_params
andset_params
, such as scikit-learn’s classes, orcompose()
will break.Return type: type Returns: The composed pipeline class Raises: TypeError
if an alias is not a string.Raises: ValueError
if an alias has an invalid name.Raises: NotImplementedError
if some class of the pipeline does not implement the required methods.
-
class
miraiml.pipeline.
NaiveBayesBaseliner
[source]¶ This is a baseline pipeline for classification problems. It’s composed by the following transformers/estimator:
sklearn.preprocessing.OneHotEncoder
sklearn.impute.SimpleImputer
sklearn.preprocessing.MinMaxScaler
sklearn.naive_bayes.GaussianNB
The available parameters to tweak are:
>>> from miraiml.pipeline import NaiveBayesBaseliner >>> for param in NaiveBayesBaseliner().get_params(): ... print(param) ... ohe__categorical_features ohe__categories ohe__drop ohe__dtype ohe__handle_unknown ohe__n_values ohe__sparse impute__add_indicator impute__fill_value impute__missing_values impute__strategy impute__verbose min_max__feature_range naive__priors naive__var_smoothing
-
class
miraiml.pipeline.
LinearRegressionBaseliner
[source]¶ This is a baseline pipeline for regression problems. It’s composed by the following transformers/estimator:
sklearn.preprocessing.OneHotEncoder
sklearn.impute.SimpleImputer
sklearn.preprocessing.MinMaxScaler
sklearn.linear_model.LinearRegression
The available parameters to tweak are:
>>> from miraiml.pipeline import LinearRegressionBaseliner >>> for param in LinearRegressionBaseliner().get_params(): ... print(param) ... ohe__categorical_features ohe__categories ohe__drop ohe__dtype ohe__handle_unknown ohe__n_values ohe__sparse impute__add_indicator impute__fill_value impute__missing_values impute__strategy impute__verbose min_max__feature_range lin_reg__fit_intercept lin_reg__n_jobs lin_reg__normalize