API reference / Analysis and models

You are viewing the documentation for metbit 9.0.0. Change release context

metbit.analysis.opls_da

Analysis and models module in metbit 9.0.0.

import metbit.analysis.opls_da

Classes

opls_da

OPLS-DA model

Parameters

Xarray-like, shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples,)

Target data, where n_samples is the number of samples.

n_componentsint, default=2

Number of components to keep.

scalestr, default='par'

Method of scaling. 'par' for pareto scaling, 'mc' for mean centering, 'uv' for unitvarian scaling.

cvint, default=5

Number of cross-validation folds.

n_permutationsint, default=1000

Number of permutations for permutation test.

random_stateint, default=42

Random state for permutation test.

kfoldint, default=3

Number of cross-validation folds.

Examples:

## Import package into python from metbit import opls_da, pca import pandas as pd import numpy as np

## Load dataset For example dataset are generated by random

data = pd.DataFrame(np.random.rand(500, 50000)) class_ = pd.Series(np.random.choice(['A', 'B'], 500), name='Group')

datasets = pd.concat([class_, data], axis=1)

# Assign X and target X = datasets.iloc[:, 2:] y = datasets['Group'] time = datasets['Time point'] features_name = list(X.columns.astype(float))

## Perform OPLS-DA model

opls_da_mod = opls_da(X=X, y=y,features_name=features_name, n_components=2, scaling_method='pareto', kfold=3, estimator='opls', random_state=42):

opls_da.fit()

opls.permutation_test(n_permutataion=1000,cv=3, n_jobs=-1, verbose=10)

opls_da.vip_scores()

## Isualiseation of OPLs-DA model

opls_da_model.plot_oplsda_scores()

opls_da_model.vip_plot()

opls_da_model.plot_hist()

opls_da_model.plot_s_scores()

opls_da_model.plot_loading()

Methods

__init__(self, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray, List[Any]], features_name: Optional[Union[pd.Series, np.ndarray, List[Any]]]=None, n_components: int=2, scaling_method: str='pareto', kfold: int=3, estimator: str='opls', random_state: int=42, auto_ncomp: bool=True, dtype: Optional[type]=None)

Purpose: Initializes the model, validates the inputs, and preprocesses the data for further analysis.

Parameters:

Xarray-like (Data matrix)

The input features for the model (must be a 2D array-like structure, such as pandas.DataFrame or numpy.ndarray).

yarray-like (Target vector)

The target variable or class labels associated with the rows of X (can be a pandas.Series, numpy.ndarray, or list).

features_namearray-like, optional (default=None)

List of feature names corresponding to the columns of X. If not provided, column names will be generated or inferred.

n_componentsint, optional (default=2)

The number of components to use for OPLS-DA analysis. Must be a positive integer.

scaling_methodstr, optional (default='pareto')

The method used for scaling the data. Options include: 'pareto': Pareto scaling (power 0.5) 'mean': Mean-centered data 'uv': Unit variance scaling (standardization) 'minmax': Min-max scaling

kfoldint, optional (default=3)

The number of folds for cross-validation.

estimatorstr, optional (default='opls')

The estimator method used for modeling. Default is 'opls' (Orthogonal Projections to Latent Structures).

random_stateint, optional (default=42)

The random seed used for reproducibility.

auto_ncompbool, optional (default=True)

If True, automatically selects the optimal number of components for the model. If False, the number of components is set manually.

dtypenumpy dtype, optional (default=None)

Storage dtype for the feature matrix. None uses float64 (default). For cohorts with >10,000 samples or >100,000 features, pass numpy.float32 to halve peak memory usage. Metabolomics data rarely requires >6 significant figures, so float32 is safe in practice.

Raises:

ValueErrorIf any input is invalid, such as mismatched dimensions or incorrect data types.

fit(self)

Purpose: Fits the OPLS-DA model to the data and computes model performance metrics.

Parameters: None

Returns:

Prints a summary of the model fit, including: Sample size and class distributions Number of features and components Scaling method used R2 and Q2 metrics

get_oplsda_scores(self)

Get OPLS-DA scores

get_s_scores(self)

Get S scores

get_oplsda_model(self)

Get OPLS-DA model

get_cv_model(self)

Get cross-validation model

permutation_test(self, n_permutations: int=500, cv: int=3, n_jobs: int=-1, verbose: int=10)

get_permutation_scores(self)

Get permutation scores

vip_scores(self, model: Optional[Any]=None, features_name: Optional[Union[pd.Series, np.ndarray, List[Any]]]=None)

Get VIP score

Parameters

modelobject, default=None

OPLS-DA model.

features_namearray-like, shape (n_features,), default=None

Name of features.

get_vip_scores(self, filter_: bool=False, threshold: float=1)

Get VIP score

Parameters

filter_bool, default=False

If True, filter VIP score based on threshold.

thresholdint, default=1

Threshold of VIP score.

vip_plot(self, x_range: int=9, threshold: float=2, marker_size: int=12, fig_width=1000, fig_height=500, filter_=False, vip_transform=True, font_size=20, title_font_size=20, xaxis_direction: str='reversed')

Purpose: Plots the VIP scores for the features, allowing for customization and thresholding.

Parameters:

x_rangeint, optional (default=9)

The range of the x-axis in the plot.

thresholdint, optional (default=2)

The threshold used to categorize VIP scores into "High" or "Low" importance.

marker_sizeint, optional (default=12)

The size of the markers in the scatter plot.

fig_widthint, optional (default=1000)

The width of the plot.

fig_heightint, optional (default=500)

The height of the plot.

filter_bool, optional (default=False)

If True, only features with VIP scores above the threshold are plotted.

vip_transformbool, optional (default=True)

If True, the VIP scores are transformed to reflect their contribution to the model.

font_sizeint, optional (default=20)

The font size for labels.

title_font_sizeint, optional (default=20)

The font size for the plot title.

xaxis_directionstr, optional (default='reversed')

The direction of the x-axis ('reversed' or 'normal').

Returns:

figPlotly figure

A Plotly scatter plot visualizing the VIP scores for the features.

plot_oplsda_scores(self, x_='t_scores', y_='t_ortho', color_=None, color_dict=None, symbol_=None, symbol_dict=None, fig_height=900, fig_width=1300, marker_size=35, marker_opacity=0.7, marker_label=None, font_size=20, title_font_size=21, legend_name=['Group', 'Time point'], individual_ellipse=True)

Plot OPLS-DA scores plot

Parameters

color_array-like, shape (n_samples,), default=None

color_ of the group. If None, color_ will be based on the group in y.

color_dictdict, default=None

Dictionary of color_ for the group. If None, color_ will be based on the group in y.

symbol_array-like, shape (n_samples,), default=None

symbol_ of the group. If None, symbol_ will be based on the group in y.

symbol_dictdict, default=None

Dictionary of symbol_ for the group. If None, symbol_ will be based on the group in y.

fig_heightint, default=900

Height of the figure.

fig_widthint, default=1300

Width of the figure.

marker_sizeint, default=35

Size of the marker.

marker_opacityfloat, default=0.7

Opacity of the marker.

PurposeThis function generates an OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) scores plot, showing how samples are positioned based on their scores in two principal components (t_scores and t_ortho).

Parameters:

x_, y_The names of the columns in the DataFrame (df_opls_scores) that contain the scores for the x and y axes (default to 't_scores' and 't_ortho').

color_An array that assigns a color to each sample (optional).

color_dictA dictionary of color mappings for the groups (optional).

symbol_An array of symbols for the samples (optional).

fig_height, fig_widthDimensions for the plot.

marker_size, marker_opacityControl the appearance of the markers.

legend_nameCustom labels for the legend.

individual_ellipseWhether to add ellipses to individual groups (default True).

Plot Details: Uses Plotly (px.scatter) to create an interactive scatter plot. Can display confidence ellipses around each group. Adds annotations for R2X, R2Y, and Q2 statistics. The plot is highly customizable (marker size, opacity, labels, colors, etc.).

plot_hist(self, nbins_: int=50, fig_height: int=500, fig_width: int=1000, font_size: int=14, title_font_size: int=20)

Plot histogram of permutation scores

Parameters

nbins_int, default=50

Number of bins for histogram.

fig_heightint, default=500

Height of the figure.

fig_widthint, default=1000

Width of the figure.

PurposeThis function creates a histogram of permutation scores from a permutation test, commonly used to evaluate model stability and significance.

Parameters:

nbins_Number of bins in the histogram.

fig_height, fig_widthDimensions of the plot.

font_size, title_font_sizeFont size for labels and title.

Plot Details: Plots the permutation scores as a histogram with Plotly. Marks the actual model accuracy score with a red dashed line. Adds additional annotations for the number of permutations, the accuracy score, and the p-value.

plot_s_scores(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=14, font_size=20, title_font_size=20)

Plot S-plot

Parameters

fig_heightint, default=900

Height of the figure.

fig_widthint, default=2000

Width of the figure.

range_colorlist, default=[-0.05,0.05]

Range of color_ for the plot.

color_continuous_scale_str, default='jet'

color_ scale for the plot.

PurposeThis function generates a scatter plot (S-plot), which visualizes the covariance and correlation between the scores of the model.

Parameters:

fig_height, fig_widthDimensions of the plot.

range_color_Range of colors to display.

color_continuous_scale_Color scale for the plot.

marker_size, font_size, title_font_sizeCustomize marker size and font sizes.

Plot Details: The plot visualizes the relationship between covariance and correlation for features. Uses Plotly's scatter plot to create an interactive S-plot. The axes are customizable, and the plot is set to be visually clean (e.g., axes lines and tick marks).

plot_loading(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=5, font_size=20, title_font_size=20, xaxis_direction='reversed', xaxis_title='𝛿<sub>H</sub> in ppm')

Plot loading plot

Parameters

fig_heightint, default=900

Height of the figure.

fig_widthint, default=2000

Width of the figure.

range_colorlist, default=[-0.05,0.05]

Range of color_ for the plot.

color_continuous_scale_str, default='jet'

color_ scale for the plot.

PurposeThis function generates a loading plot, typically used in multivariate analysis to visualize the relationship between features and the scores.

Parameters:

fig_height, fig_widthDimensions of the plot.

range_color_Color range to represent the covariance values.

color_continuous_scale_Color scale used for continuous color mapping.

marker_size, font_size, title_font_sizeCustomize the appearance of the markers and fonts.

xaxis_directionSet the direction for the x-axis (e.g., reversed or not).

xaxis_titleTitle for the x-axis.

Plot Details: The loading plot shows the relationship between features (usually a set of variables or compounds) and the model scores. The correlation for each feature is displayed alongside its covariance, with colors representing the covariance values. The plot is interactive and customizable (e.g., marker size, color scale, axis settings).

Source

metbit/analysis/opls_da.py at v9.0.0