crandas.crlearn

The module crandas.crlearn provides the following functionality:

Machine learing models:

General:

Model interface

class crandas.crlearn.model.CModel(instance=None)

Bases: object

Base class for machine learning models stored at the server

The API for crandas machine learning models is similar to that of scikit-learn estimators. Functions such as .fit() and set_params() are applied in-place to the CModel. Internally, the CModel has a field instance that refers to a stateobject.StateObject at the server and that is updated when such functions are applied.

Similarly to scikit-learn, models have parameters (set by the user) and attributes (set based on fitting the model on data). Typically, at least some of the attributes are encrypted (e.g., fitted model parameters). The encrypted attributes can be retrieved by opening the model using open().

property attributes

Retrieve attributes for the estimator.

Only available for fitted estimators. Encrypted attributes are set to None. To retrieve their values, use open().

classmethod from_opened(params_attributes, **query_args)

Upload model to the server

Should be called on an instance of the final model class, e.g., linear_model.CLinearRegression.

Parameters:

params_attributes (dict) – Parameters and attributes as returned by open()

property handle

Return the handle of the current instance of the model

instance = None

Current model instance (updated by .fit(), set_params(), etc)

open()

Download the model

The model is returned as a dictionary of parameters and attributes.

property params

Retrieve parameters for the estimator.

Note that the parameters can be set using set_params().

save(name=None, **query_args)

Save the (current instance of the) model. See stateobject.StateObject.save().

set_params(**params_and_query_args)

Set parameters of the estimator

Parameters:

params_and_query_args (dict) – Model parameters (see documentation for specific model) or query arguments (see Query Arguments)

Metrics

crandas.crlearn.metrics.classification_accuracy(y, y_pred, n_classes=2, **query_args)

Compute the classification accuracy on class predictions

Parameters:
  • y (CDataFrame) – column with the actual values in range

  • y_pred (CDataFrame) – column with the predictions in range

  • n_classes (int) – number of classes (default = 2)

  • query_args – See Query Arguments

Returns:

fixed point number between 0 and 1

Return type:

CDataFrame

crandas.crlearn.metrics.confusion_matrix(y, y_pred, n_classes=2, **query_args)

Compute the confusion matrix on class predictions

The y-axis of the result represents the true class. The x-axis the predicted class.

Parameters:
  • y (CDataFrame) – column with the actual values in range

  • y_pred (CDataFrame) – column with the predictions in range

  • n_classes (int) – number of classes (default = 2)

  • query_args – See Query Arguments

Returns:

matrix of size n_classes * n_classes

Return type:

CDataFrame

crandas.crlearn.metrics.mcfadden_r2(model, X, y, **query_args)

Compute the McFadden R^2 metric

Parameters:
  • model (LogisticModel) – logistic regression model

  • X (CDataFrame) – predictor variables

  • y (CDataFrame) – binary response variable (should have only 1 column)

  • query_args – See Query Arguments

Returns:

fixed point number between 0 and 1

Return type:

CDataFrame

crandas.crlearn.metrics.model_deviance(model, X, y, **query_args)

Compute the model deviance

Parameters:
  • model (LogisticModel) – logistic regression model

  • X (CDataFrame) – predictor variables

  • y (CDataFrame) – binary response variable (should have only 1 column)

  • query_args – See Query Arguments

Returns:

fixed point number between 0 and 1

Return type:

CDataFrame

crandas.crlearn.metrics.null_deviance(y, **query_args)

Compute the null deviance

Parameters:
  • y (CDataFrame) – binary response variable (should have only 1 column)

  • query_args – See Query Arguments

  • NOTE (both classes NEED to be present in 'y', otherwise the computations are undefined internally (logarithm of 0))

Returns:

fixed point number between 0 and 1

Return type:

CDataFrame

crandas.crlearn.metrics.precision_recall(y, y_pred, **query_args)

Compute the precision and recall on predictions

Parameters:
  • y (CDataFrame) – column with the actual values (binary)

  • y_pred (CDataFrame) – column with the predictions (binary)

query_args :

See Query Arguments

Returns:

two fixed numbers between 0 and 1

Return type:

CDataFrame

crandas.crlearn.metrics.score_r2(y, y_pred, **query_args)

Compute the R^2 metric on predictions

Parameters:
Returns:

fixed point number between < 1

Return type:

CDataFrame

crandas.crlearn.metrics.tjur_r2(y, y_pred, **query_args)

Compute the Tjur R^2 metric on predictions

Parameters:
Returns:

fixed point number between -1 and 1

Return type:

CDataFrame

Utility

crandas.crlearn.utils.min_max_normalize(table, columns=None, **query_args)

Apply min-max normalization on columns of a table, to get values in [0, 1]

Parameters:
  • table (CDataFrame) – table to normalize

  • columns (list of strings, optional) – columns that should be normalized. If None, all columns will be normalized. The columns that are not specified in this list will remain untouched, by default None

  • query_args – See Query Arguments

Returns:

new table with normalized columns

Return type:

CDataFrame

Linear regression

class crandas.crlearn.linear_model.CLinearRegression(instance=None)

Bases: CModel

Linear ridge regression classifier corresponding to the scikit-learn Ridge class (see here).

Parameters:

  • alpha: regularization strength (see scikit-learn documentation); defaults to 1.0

Attributes:

  • n_features_in_: number of input features

  • feature_names_in_: input feature names

  • beta_: (encrypted) fitted parameters (intercept and respective feature coefficients)

fit(X, y, **query_args)

Fit a Linear Regression model on the data

Parameters:
Return type:

self

get_beta(**query_args)

Get the fitted parameters (i.e. intercept and feature coeficients) as a table

This function is deprecated; instead, use CModel.open() to open the model, and use the returned beta_ attribute.

predict(X, **query_args)

Make predictions on a dataset using a linear regression model

Note: this returns predictions on the target, not probabilities!

Parameters:
Returns:

table containing the column consisting of the predicted target values

Return type:

CDataFrame

score(X, y, **query_args)

Scores the linear regression model using the R2 metric

Parameters:
Return type:

self

crandas.crlearn.linear_model.LinearRegression(alpha=0.0, *, fit_intercept=True, copy_X=True, n_jobs=None, positive=False, **params_and_query_args)

Create a new linear regression model (CLinearRegression) with given alpha (0.0 by default)

Other parameters are for compatibility with scikit-learn and cannot be overriden.

crandas.crlearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=None, solver='cholesky', positive=False, random_state=None, **params_and_query_args)

Create a new ridge regression model (CLinearRegression) with given alpha (1.0 by default)

Other parameters are for compatibility with scikit-learn and cannot be overriden.

Logistic regression

class crandas.crlearn.logistic_regression.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=10, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None, classes=[], n_classes=2)

Bases: object

Logistic Regression Classifier Object with the same parameters as the Scikit-learn Logistic Regression Class

See here for its parameters.

fit(X, y, sample_weight=None, max_iter=None, warm_start=None, **query_args)

Fit a Logistic Regression model on the data

Parameters:
  • X (CDataFrame) – predictor variables

  • y (CDataFrame) – response variable (should have only 1 column) that columns should be integer.

  • sample_weight – array of weights assigned to individual sampled (Not yet supported)

  • max_iter (int) – deviation from Scikit (see note above)

  • warm_start (bool) – deviation from Scikit (see note above) if True: determines whether successive fits continue approximation from where it stopped else: indicates that each successive fit will start from scratch.

  • query_args – See Query Arguments

Returns:

self

Return type:

LogisticRegression

Notes

Note

Compared to Scikit-learn we add the parameter max_iter and warm_start. Scikit-learn treats max_iter and warm_start as object configurations which are set at construction and cannot be changed. We prefer to give the user the freedom of deviating form the global setting in successive calls to fit().

We rather use the corresponding class attributes as default values for each call to fit.

get_beta(**kwargs)

Get the fitted parameters (i.e. intercept_ and coef_ combined in 1 table named beta).

predict(X, decision_boundary=0.5, **query_args)

Make (binary) predictions on a dataset using a logistic regression model

Note: this returns binary predictions, not probabilities!

Parameters:
  • X (CDataFrame) – predictor variables

  • decision_boundary (float) – number between 0 and 1; records with a probability below this value are classified as 0, greater than or equal to as 1

  • query_args – See Query Arguments

Returns:

column consisting of the predicted probabilities

Return type:

CDataFrame

predict_proba(X, **query_args)

Make (probability) predictions on a dataset using a logistic regression model

Note: this returns probabilities, not binary predictions

Parameters:
Returns:

column consisting of the predicted probabilities

Return type:

CDataFrame

class crandas.crlearn.logistic_regression.LogisticRegressionStateObject(reg_type=None, **kwargs)

Bases: StateObject

Random forest classifier

class crandas.crlearn.ensemble.CRandomForestClassifier(instance=None)

Bases: CModel

Random forest classifier

Note

Do not instantiate directly by calling CRandomForestClassifier(...). Instead, use the RandomForestClassifier() method.

Random forest classifier with an interface similar to skikit-learn’s RandomForestClassifier. See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

Features can be ordinal (node “val <= T”) or categorical (node “val == T”), as specified when fitting. Output labels are categorical values 0, 1, …, M, where the number M is derived from the column metadata, e.g., use labels.astype({"labels": "int[min=0,max=2]"}) to set M=2.

Configurable parameters:

  • n_estimators: number of trees (integer; default: 10)

  • max_depth: depth (number of layers of internal nodes) per tree (integer; default: 4)

  • bootstrap: whether to use bootstrapping, i.e., training respective trees on respective samples (drawn with replacement) from the input data (boolean; default: True)

  • max_features: number of features to consider per split (integer number/float fraction/ “sqrt”/”log2”/”all”; default: “sqrt”)

  • max_samples: number of samples per tree if bootstrapping is used (integer number/float fraction; default: 0.3)

Other sklearn parameters are either not applicable to the current implementation or cannot be changed from their defaults; see RandomForestClassifier() for details.

Attributes of fitted models:

  • n_features_in: number of input features

  • feature_names_in: input feature names

  • feature_types_in: column types of input feature columns

  • n_classes: number of classes

  • feature_name_out: name of output feature

  • depths: depths of respective trees

  • nodes_featureids: (encrypted) one-hot encoded features selected per internal node

  • nodes_values: (encrypted) threshold values per internal node

  • nodes_modes: (encrypted) mode per internal node: discrete or continuous

  • class_weights: (encrypted) class probabilities per leaf node

Implementation notes:

The implemented random forest classifier uses the same training technique as sklearn, except that trees are always trained up to their maximum depth max_depth (for this reason, options min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_leaf_nodes, min_impurity_decrease are not applicable).

fit(X, y, categorical_features=None, max_categories=None, **query_args)

Build a forest of trees from the training set

The dataset X can contain any mix of ordinal features (nodes val <= T) and categorical features (nodes val == T). Features are considered ordinal unless specified in the list of categorical features. Fitting of categorical features is considerably more efficient but costs more memory, so the number of different categories is limited by the server. This limit can be overridden using the max_categories parameter.

Parameters:
  • X (CDataFrame) – Training data

  • y (CDataFrame) – Target data (should have only 1 column)

  • categorical_features (None or list of str) – Column names of columns that are considered to contain categorical features

  • max_categories (None or int) – Maximum number of categories per categorical feature

  • query_args – See Query Arguments

Return type:

self

open_to_graphs(**query_args)

Open the random forest into a list of pydot.Dot instances

To plot a single tree, you can select a particular index in the returned list, for example:

graph = model.open_to_graphs()[0])
graph.write_png("out.png")
from IPython.display import Image
Image("out.png")
predict(X, **query_args)

Predict class for X

Parameters:
Returns:

table with predicted class per input record

Return type:

CDataFrame

predict_proba(X, **query_args)

Predict class probabilities for X

Parameters:
Returns:

table with columns representing predicted class probabilities per input record

Return type:

CDataFrame

crandas.crlearn.ensemble.RandomForestClassifier(n_estimators=10, *, max_depth=4, bootstrap=True, max_features='sqrt', max_samples=0.3, criterion='gini', random_state=None, warm_start=False, class_weight=None, ccp_alpha=0.0, monitonic_cst=0, **query_args)

Create a new random forest classifier (class CRandomForestClassifier) with the given parameters.

The parameters n_estimators, max_depth, bootstrap, max_features, and max_features can be changed. See CRandomForestClassifier for their meaning.

The parameters criterion, random_state, warm_start, class_weight, ccp_alpha, monotonic_cst have the same meaning as in sklearn but cannot be changed from their defaults.

Other sklearn parameters are not applicable to the present implementation. Options min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_leaf_nodes, min_impurity_decrease are not applicable since the current implementation trains all trees up to their maximum depth max_depth. Generalization scores are not provided, so oob_score is not supported. Also parameters n_jobs and verbose for controlling the fitting process are not available.

k-nearest neighbours regressor

class crandas.crlearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', p=2, metric='minkowski', metric_weights=None)

Bases: object

Regression based on k-nearest neighbors with similar use as the Scikit-learn K-Nearest Regressor Class.

The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Parameters:
  • n_neighbors (int, default=5) – Number of neighbors to use.

  • p (int, default=2) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Currently, integer values between 1 and 5 are supported.

  • metric_weights (CDataFrame, default=None) –

    Weights given to the different columns for the metric. The differences between columns are multiplied by the corresponding factors given in metric_weights. This is equivalent to multiplying all columns by the corresponding weights.

    None means no extra factors, equivalent to all weights being 1.

Notes

Warning

Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.

fit(X, y)

Fit the k-nearest neighbors classifier from the training dataset.

Parameters:
  • X (CDataFrame) – Predictor variables.

  • y (CDataFrame) – Response variable (should have only 1 column).

Returns:

self

Return type:

KNeighborsRegressor

predict_value(X, **query_args)

Predict the target value for the provided data.

Parameters:
Returns:

y – Predicted value.

Return type:

ReturnValue