crandas.crlearn¶
The module crandas.crlearn
provides the following functionality:
Machine learing models:
General:
Model interface¶
- class crandas.crlearn.model.CModel(instance=None)¶
Bases:
object
Base class for machine learning models stored at the server
The API for crandas machine learning models is similar to that of scikit-learn estimators. Functions such as
.fit()
andset_params()
are applied in-place to theCModel
. Internally, theCModel
has a fieldinstance
that refers to astateobject.StateObject
at the server and that is updated when such functions are applied.Similarly to scikit-learn, models have parameters (set by the user) and attributes (set based on fitting the model on data). Typically, at least some of the attributes are encrypted (e.g., fitted model parameters). The encrypted attributes can be retrieved by opening the model using
open()
.- property attributes¶
Retrieve attributes for the estimator.
Only available for fitted estimators. Encrypted attributes are set to
None
. To retrieve their values, useopen()
.
- classmethod from_opened(params_attributes, **query_args)¶
Upload model to the server
Should be called on an instance of the final model class, e.g.,
linear_model.CLinearRegression
.- Parameters:
params_attributes (dict) – Parameters and attributes as returned by
open()
- property handle¶
Return the handle of the current instance of the model
- instance = None¶
Current model instance (updated by
.fit()
,set_params()
, etc)
- open()¶
Download the model
The model is returned as a dictionary of parameters and attributes.
- property params¶
Retrieve parameters for the estimator.
Note that the parameters can be set using
set_params()
.
- save(name=None, **query_args)¶
Save the (current instance of the) model. See
stateobject.StateObject.save()
.
- set_params(**params_and_query_args)¶
Set parameters of the estimator
- Parameters:
params_and_query_args (dict) – Model parameters (see documentation for specific model) or query arguments (see Query Arguments)
Metrics¶
- crandas.crlearn.metrics.classification_accuracy(y, y_pred, n_classes=2, **query_args)¶
Compute the classification accuracy on class predictions
- Parameters:
y (CDataFrame) – column with the actual values in range
y_pred (CDataFrame) – column with the predictions in range
n_classes (int) – number of classes (default = 2)
query_args – See Query Arguments
- Returns:
fixed point number between 0 and 1
- Return type:
- crandas.crlearn.metrics.confusion_matrix(y, y_pred, n_classes=2, **query_args)¶
Compute the confusion matrix on class predictions
The y-axis of the result represents the true class. The x-axis the predicted class.
- Parameters:
y (CDataFrame) – column with the actual values in range
y_pred (CDataFrame) – column with the predictions in range
n_classes (int) – number of classes (default = 2)
query_args – See Query Arguments
- Returns:
matrix of size n_classes * n_classes
- Return type:
- crandas.crlearn.metrics.mcfadden_r2(model, X, y, **query_args)¶
Compute the McFadden R^2 metric
- Parameters:
model (LogisticModel) – logistic regression model
X (CDataFrame) – predictor variables
y (CDataFrame) – binary response variable (should have only 1 column)
query_args – See Query Arguments
- Returns:
fixed point number between 0 and 1
- Return type:
- crandas.crlearn.metrics.model_deviance(model, X, y, **query_args)¶
Compute the model deviance
- Parameters:
model (LogisticModel) – logistic regression model
X (CDataFrame) – predictor variables
y (CDataFrame) – binary response variable (should have only 1 column)
query_args – See Query Arguments
- Returns:
fixed point number between 0 and 1
- Return type:
- crandas.crlearn.metrics.null_deviance(y, **query_args)¶
Compute the null deviance
- Parameters:
y (CDataFrame) – binary response variable (should have only 1 column)
query_args – See Query Arguments
NOTE (both classes NEED to be present in 'y', otherwise the computations are undefined internally (logarithm of 0))
- Returns:
fixed point number between 0 and 1
- Return type:
- crandas.crlearn.metrics.precision_recall(y, y_pred, **query_args)¶
Compute the precision and recall on predictions
- Parameters:
y (CDataFrame) – column with the actual values (binary)
y_pred (CDataFrame) – column with the predictions (binary)
- query_args :
See Query Arguments
- Returns:
two fixed numbers between 0 and 1
- Return type:
- crandas.crlearn.metrics.score_r2(y, y_pred, **query_args)¶
Compute the R^2 metric on predictions
- Parameters:
y (CDataFrame) – column with the actual values
y_pred (CDataFrame) – column with the predictions
query_args – See Query Arguments
- Returns:
fixed point number between < 1
- Return type:
- crandas.crlearn.metrics.tjur_r2(y, y_pred, **query_args)¶
Compute the Tjur R^2 metric on predictions
- Parameters:
y (CDataFrame) – column with the actual values (binary)
y_pred (CDataFrame) – column with the predictions (probabilities!)
query_args – See Query Arguments
- Returns:
fixed point number between -1 and 1
- Return type:
Utility¶
- crandas.crlearn.utils.min_max_normalize(table, columns=None, **query_args)¶
Apply min-max normalization on columns of a table, to get values in [0, 1]
- Parameters:
table (CDataFrame) – table to normalize
columns (list of strings, optional) – columns that should be normalized. If None, all columns will be normalized. The columns that are not specified in this list will remain untouched, by default None
query_args – See Query Arguments
- Returns:
new table with normalized columns
- Return type:
Linear regression¶
- class crandas.crlearn.linear_model.CLinearRegression(instance=None)¶
Bases:
CModel
Linear ridge regression classifier corresponding to the scikit-learn
Ridge
class (see here).Parameters:
alpha
: regularization strength (see scikit-learn documentation); defaults to1.0
Attributes:
n_features_in_
: number of input featuresfeature_names_in_
: input feature namesbeta_
: (encrypted) fitted parameters (intercept and respective feature coefficients)
- fit(X, y, **query_args)¶
Fit a Linear Regression model on the data
- Parameters:
X (CDataFrame) – Training data
y (CDataFrame) – Target data (should have only 1 column)
query_args – See Query Arguments
- Return type:
self
- get_beta(**query_args)¶
Get the fitted parameters (i.e. intercept and feature coeficients) as a table
This function is deprecated; instead, use
CModel.open()
to open the model, and use the returnedbeta_
attribute.
- predict(X, **query_args)¶
Make predictions on a dataset using a linear regression model
Note: this returns predictions on the target, not probabilities!
- Parameters:
X (CDataFrame) – predictor variables
query_args – See Query Arguments
- Returns:
table containing the column consisting of the predicted target values
- Return type:
- score(X, y, **query_args)¶
Scores the linear regression model using the R2 metric
- Parameters:
X (CDataFrame) – Test data
y (CDataFrame) – Target test data (should have only 1 column)
query_args – See Query Arguments
- Return type:
self
- crandas.crlearn.linear_model.LinearRegression(alpha=0.0, *, fit_intercept=True, copy_X=True, n_jobs=None, positive=False, **params_and_query_args)¶
Create a new linear regression model (
CLinearRegression
) with given alpha (0.0 by default)Other parameters are for compatibility with scikit-learn and cannot be overriden.
- crandas.crlearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=None, solver='cholesky', positive=False, random_state=None, **params_and_query_args)¶
Create a new ridge regression model (
CLinearRegression
) with given alpha (1.0 by default)Other parameters are for compatibility with scikit-learn and cannot be overriden.
Logistic regression¶
- class crandas.crlearn.logistic_regression.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=10, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None, classes=[], n_classes=2)¶
Bases:
object
Logistic Regression Classifier Object with the same parameters as the Scikit-learn Logistic Regression Class
See here for its parameters.
- fit(X, y, sample_weight=None, max_iter=None, warm_start=None, **query_args)¶
Fit a Logistic Regression model on the data
- Parameters:
X (CDataFrame) – predictor variables
y (CDataFrame) – response variable (should have only 1 column) that columns should be integer.
sample_weight – array of weights assigned to individual sampled (Not yet supported)
max_iter (int) – deviation from Scikit (see note above)
warm_start (bool) – deviation from Scikit (see note above) if True: determines whether successive fits continue approximation from where it stopped else: indicates that each successive fit will start from scratch.
query_args – See Query Arguments
- Returns:
self
- Return type:
Notes
Note
Compared to Scikit-learn we add the parameter
max_iter
andwarm_start
. Scikit-learn treatsmax_iter
andwarm_start
as object configurations which are set at construction and cannot be changed. We prefer to give the user the freedom of deviating form the global setting in successive calls tofit()
.We rather use the corresponding class attributes as default values for each call to fit.
- get_beta(**kwargs)¶
Get the fitted parameters (i.e. intercept_ and coef_ combined in 1 table named beta).
- predict(X, decision_boundary=0.5, **query_args)¶
Make (binary) predictions on a dataset using a logistic regression model
Note: this returns binary predictions, not probabilities!
- Parameters:
X (CDataFrame) – predictor variables
decision_boundary (float) – number between 0 and 1; records with a probability below this value are classified as 0, greater than or equal to as 1
query_args – See Query Arguments
- Returns:
column consisting of the predicted probabilities
- Return type:
- predict_proba(X, **query_args)¶
Make (probability) predictions on a dataset using a logistic regression model
Note: this returns probabilities, not binary predictions
- Parameters:
X (CDataFrame) – predictor variables
query_args – See Query Arguments
- Returns:
column consisting of the predicted probabilities
- Return type:
- class crandas.crlearn.logistic_regression.LogisticRegressionStateObject(reg_type=None, **kwargs)¶
Bases:
StateObject
Random forest classifier¶
- class crandas.crlearn.ensemble.CRandomForestClassifier(instance=None)¶
Bases:
CModel
Random forest classifier
Note
Do not instantiate directly by calling
CRandomForestClassifier(...)
. Instead, use theRandomForestClassifier()
method.Random forest classifier with an interface similar to skikit-learn’s RandomForestClassifier. See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
Features can be ordinal (node “val <= T”) or categorical (node “val == T”), as specified when fitting. Output labels are categorical values 0, 1, …, M, where the number M is derived from the column metadata, e.g., use
labels.astype({"labels": "int[min=0,max=2]"})
to set M=2.Configurable parameters:
n_estimators
: number of trees (integer; default: 10)max_depth
: depth (number of layers of internal nodes) per tree (integer; default: 4)bootstrap
: whether to use bootstrapping, i.e., training respective trees on respective samples (drawn with replacement) from the input data (boolean; default: True)max_features
: number of features to consider per split (integer number/float fraction/ “sqrt”/”log2”/”all”; default: “sqrt”)max_samples
: number of samples per tree if bootstrapping is used (integer number/float fraction; default: 0.3)
Other sklearn parameters are either not applicable to the current implementation or cannot be changed from their defaults; see
RandomForestClassifier()
for details.Attributes of fitted models:
n_features_in
: number of input featuresfeature_names_in
: input feature namesfeature_types_in
: column types of input feature columnsn_classes
: number of classesfeature_name_out
: name of output featuredepths
: depths of respective treesnodes_featureids
: (encrypted) one-hot encoded features selected per internal nodenodes_values
: (encrypted) threshold values per internal nodenodes_modes
: (encrypted) mode per internal node: discrete or continuousclass_weights
: (encrypted) class probabilities per leaf node
Implementation notes:
The implemented random forest classifier uses the same training technique as sklearn, except that trees are always trained up to their maximum depth
max_depth
(for this reason, optionsmin_samples_split
,min_samples_leaf
,min_weight_fraction_leaf
,max_leaf_nodes
,min_impurity_decrease
are not applicable).- fit(X, y, categorical_features=None, max_categories=None, **query_args)¶
Build a forest of trees from the training set
The dataset X can contain any mix of ordinal features (nodes val <= T) and categorical features (nodes val == T). Features are considered ordinal unless specified in the list of categorical features. Fitting of categorical features is considerably more efficient but costs more memory, so the number of different categories is limited by the server. This limit can be overridden using the
max_categories
parameter.- Parameters:
X (CDataFrame) – Training data
y (CDataFrame) – Target data (should have only 1 column)
categorical_features (None or list of str) – Column names of columns that are considered to contain categorical features
max_categories (None or int) – Maximum number of categories per categorical feature
query_args – See Query Arguments
- Return type:
self
- open_to_graphs(**query_args)¶
Open the random forest into a list of
pydot.Dot
instancesTo plot a single tree, you can select a particular index in the returned list, for example:
graph = model.open_to_graphs()[0]) graph.write_png("out.png") from IPython.display import Image Image("out.png")
- predict(X, **query_args)¶
Predict class for X
- Parameters:
X (CDataFrame) – predictor variables
query_args – See Query Arguments
- Returns:
table with predicted class per input record
- Return type:
- predict_proba(X, **query_args)¶
Predict class probabilities for X
- Parameters:
X (CDataFrame) – predictor variables
query_args – See Query Arguments
- Returns:
table with columns representing predicted class probabilities per input record
- Return type:
- crandas.crlearn.ensemble.RandomForestClassifier(n_estimators=10, *, max_depth=4, bootstrap=True, max_features='sqrt', max_samples=0.3, criterion='gini', random_state=None, warm_start=False, class_weight=None, ccp_alpha=0.0, monitonic_cst=0, **query_args)¶
Create a new random forest classifier (class
CRandomForestClassifier
) with the given parameters.The parameters
n_estimators
,max_depth
,bootstrap
,max_features
, andmax_features
can be changed. SeeCRandomForestClassifier
for their meaning.The parameters
criterion
,random_state
,warm_start
,class_weight
,ccp_alpha
,monotonic_cst
have the same meaning as in sklearn but cannot be changed from their defaults.Other sklearn parameters are not applicable to the present implementation. Options
min_samples_split
,min_samples_leaf
,min_weight_fraction_leaf
,max_leaf_nodes
,min_impurity_decrease
are not applicable since the current implementation trains all trees up to their maximum depthmax_depth
. Generalization scores are not provided, sooob_score
is not supported. Also parametersn_jobs
andverbose
for controlling the fitting process are not available.
k-nearest neighbours regressor¶
- class crandas.crlearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', p=2, metric='minkowski', metric_weights=None)¶
Bases:
object
Regression based on k-nearest neighbors with similar use as the Scikit-learn K-Nearest Regressor Class.
The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
- Parameters:
n_neighbors (int, default=5) – Number of neighbors to use.
p (int, default=2) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Currently, integer values between 1 and 5 are supported.
metric_weights (CDataFrame, default=None) –
Weights given to the different columns for the metric. The differences between columns are multiplied by the corresponding factors given in
metric_weights
. This is equivalent to multiplying all columns by the corresponding weights.None
means no extra factors, equivalent to all weights being 1.
Notes
Warning
Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.
- fit(X, y)¶
Fit the k-nearest neighbors classifier from the training dataset.
- Parameters:
X (CDataFrame) – Predictor variables.
y (CDataFrame) – Response variable (should have only 1 column).
- Returns:
self
- Return type:
- predict_value(X, **query_args)¶
Predict the target value for the provided data.
- Parameters:
X (CDataFrame) – Predictor variables. Required to contain a single row.
query_args – See Query Arguments
- Returns:
y – Predicted value.
- Return type: