metbit.analysis.opls_da
Analysis and models module in metbit 9.0.0.
import metbit.analysis.opls_daClasses
opls_da
OPLS-DA model
Parameters
Xarray-like, shape (n_samples, n_features)Training data, where n_samples is the number of samples and n_features is the number of features.
yarray-like, shape (n_samples,)Target data, where n_samples is the number of samples.
n_componentsint, default=2Number of components to keep.
scalestr, default='par'Method of scaling. 'par' for pareto scaling, 'mc' for mean centering, 'uv' for unitvarian scaling.
cvint, default=5Number of cross-validation folds.
n_permutationsint, default=1000Number of permutations for permutation test.
random_stateint, default=42Random state for permutation test.
kfoldint, default=3Number of cross-validation folds.
Examples:
## Import package into python from metbit import opls_da, pca import pandas as pd import numpy as np
## Load dataset For example dataset are generated by random
data = pd.DataFrame(np.random.rand(500, 50000)) class_ = pd.Series(np.random.choice(['A', 'B'], 500), name='Group')
datasets = pd.concat([class_, data], axis=1)
# Assign X and target X = datasets.iloc[:, 2:] y = datasets['Group'] time = datasets['Time point'] features_name = list(X.columns.astype(float))
## Perform OPLS-DA model
opls_da_mod = opls_da(X=X, y=y,features_name=features_name, n_components=2, scaling_method='pareto', kfold=3, estimator='opls', random_state=42):
opls_da.fit()
opls.permutation_test(n_permutataion=1000,cv=3, n_jobs=-1, verbose=10)
opls_da.vip_scores()
## Isualiseation of OPLs-DA model
opls_da_model.plot_oplsda_scores()
opls_da_model.vip_plot()
opls_da_model.plot_hist()
opls_da_model.plot_s_scores()
opls_da_model.plot_loading()
Methods
__init__(self, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray, List[Any]], features_name: Optional[Union[pd.Series, np.ndarray, List[Any]]]=None, n_components: int=2, scaling_method: str='pareto', kfold: int=3, estimator: str='opls', random_state: int=42, auto_ncomp: bool=True, dtype: Optional[type]=None)
Purpose: Initializes the model, validates the inputs, and preprocesses the data for further analysis.
Parameters:
Xarray-like (Data matrix)The input features for the model (must be a 2D array-like structure, such as pandas.DataFrame or numpy.ndarray).
yarray-like (Target vector)The target variable or class labels associated with the rows of X (can be a pandas.Series, numpy.ndarray, or list).
features_namearray-like, optional (default=None)List of feature names corresponding to the columns of X. If not provided, column names will be generated or inferred.
n_componentsint, optional (default=2)The number of components to use for OPLS-DA analysis. Must be a positive integer.
scaling_methodstr, optional (default='pareto')The method used for scaling the data. Options include: 'pareto': Pareto scaling (power 0.5) 'mean': Mean-centered data 'uv': Unit variance scaling (standardization) 'minmax': Min-max scaling
kfoldint, optional (default=3)The number of folds for cross-validation.
estimatorstr, optional (default='opls')The estimator method used for modeling. Default is 'opls' (Orthogonal Projections to Latent Structures).
random_stateint, optional (default=42)The random seed used for reproducibility.
auto_ncompbool, optional (default=True)If True, automatically selects the optimal number of components for the model. If False, the number of components is set manually.
dtypenumpy dtype, optional (default=None)Storage dtype for the feature matrix. None uses float64 (default). For cohorts with >10,000 samples or >100,000 features, pass numpy.float32 to halve peak memory usage. Metabolomics data rarely requires >6 significant figures, so float32 is safe in practice.
Raises:
ValueErrorIf any input is invalid, such as mismatched dimensions or incorrect data types.fit(self)
Purpose: Fits the OPLS-DA model to the data and computes model performance metrics.
Parameters: None
Returns:
Prints a summary of the model fit, including: Sample size and class distributions Number of features and components Scaling method used R2 and Q2 metrics
get_oplsda_scores(self)
Get OPLS-DA scores
get_s_scores(self)
Get S scores
get_oplsda_model(self)
Get OPLS-DA model
get_cv_model(self)
Get cross-validation model
permutation_test(self, n_permutations: int=500, cv: int=3, n_jobs: int=-1, verbose: int=10)
get_permutation_scores(self)
Get permutation scores
vip_scores(self, model: Optional[Any]=None, features_name: Optional[Union[pd.Series, np.ndarray, List[Any]]]=None)
Get VIP score
Parameters
modelobject, default=NoneOPLS-DA model.
features_namearray-like, shape (n_features,), default=NoneName of features.
get_vip_scores(self, filter_: bool=False, threshold: float=1)
Get VIP score
Parameters
filter_bool, default=FalseIf True, filter VIP score based on threshold.
thresholdint, default=1Threshold of VIP score.
vip_plot(self, x_range: int=9, threshold: float=2, marker_size: int=12, fig_width=1000, fig_height=500, filter_=False, vip_transform=True, font_size=20, title_font_size=20, xaxis_direction: str='reversed')
Purpose: Plots the VIP scores for the features, allowing for customization and thresholding.
Parameters:
x_rangeint, optional (default=9)The range of the x-axis in the plot.
thresholdint, optional (default=2)The threshold used to categorize VIP scores into "High" or "Low" importance.
marker_sizeint, optional (default=12)The size of the markers in the scatter plot.
fig_widthint, optional (default=1000)The width of the plot.
fig_heightint, optional (default=500)The height of the plot.
filter_bool, optional (default=False)If True, only features with VIP scores above the threshold are plotted.
vip_transformbool, optional (default=True)If True, the VIP scores are transformed to reflect their contribution to the model.
font_sizeint, optional (default=20)The font size for labels.
title_font_sizeint, optional (default=20)The font size for the plot title.
xaxis_directionstr, optional (default='reversed')The direction of the x-axis ('reversed' or 'normal').
Returns:
figPlotly figureA Plotly scatter plot visualizing the VIP scores for the features.
plot_oplsda_scores(self, x_='t_scores', y_='t_ortho', color_=None, color_dict=None, symbol_=None, symbol_dict=None, fig_height=900, fig_width=1300, marker_size=35, marker_opacity=0.7, marker_label=None, font_size=20, title_font_size=21, legend_name=['Group', 'Time point'], individual_ellipse=True)
Plot OPLS-DA scores plot
Parameters
color_array-like, shape (n_samples,), default=Nonecolor_ of the group. If None, color_ will be based on the group in y.
color_dictdict, default=NoneDictionary of color_ for the group. If None, color_ will be based on the group in y.
symbol_array-like, shape (n_samples,), default=Nonesymbol_ of the group. If None, symbol_ will be based on the group in y.
symbol_dictdict, default=NoneDictionary of symbol_ for the group. If None, symbol_ will be based on the group in y.
fig_heightint, default=900Height of the figure.
fig_widthint, default=1300Width of the figure.
marker_sizeint, default=35Size of the marker.
marker_opacityfloat, default=0.7Opacity of the marker.
PurposeThis function generates an OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) scores plot, showing how samples are positioned based on their scores in two principal components (t_scores and t_ortho).Parameters:
x_, y_The names of the columns in the DataFrame (df_opls_scores) that contain the scores for the x and y axes (default to 't_scores' and 't_ortho').color_An array that assigns a color to each sample (optional).color_dictA dictionary of color mappings for the groups (optional).symbol_An array of symbols for the samples (optional).fig_height, fig_widthDimensions for the plot.marker_size, marker_opacityControl the appearance of the markers.legend_nameCustom labels for the legend.individual_ellipseWhether to add ellipses to individual groups (default True).Plot Details: Uses Plotly (px.scatter) to create an interactive scatter plot. Can display confidence ellipses around each group. Adds annotations for R2X, R2Y, and Q2 statistics. The plot is highly customizable (marker size, opacity, labels, colors, etc.).
plot_hist(self, nbins_: int=50, fig_height: int=500, fig_width: int=1000, font_size: int=14, title_font_size: int=20)
Plot histogram of permutation scores
Parameters
nbins_int, default=50Number of bins for histogram.
fig_heightint, default=500Height of the figure.
fig_widthint, default=1000Width of the figure.
PurposeThis function creates a histogram of permutation scores from a permutation test, commonly used to evaluate model stability and significance.Parameters:
nbins_Number of bins in the histogram.fig_height, fig_widthDimensions of the plot.font_size, title_font_sizeFont size for labels and title.Plot Details: Plots the permutation scores as a histogram with Plotly. Marks the actual model accuracy score with a red dashed line. Adds additional annotations for the number of permutations, the accuracy score, and the p-value.
plot_s_scores(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=14, font_size=20, title_font_size=20)
Plot S-plot
Parameters
fig_heightint, default=900Height of the figure.
fig_widthint, default=2000Width of the figure.
range_colorlist, default=[-0.05,0.05]Range of color_ for the plot.
color_continuous_scale_str, default='jet'color_ scale for the plot.
PurposeThis function generates a scatter plot (S-plot), which visualizes the covariance and correlation between the scores of the model.Parameters:
fig_height, fig_widthDimensions of the plot.range_color_Range of colors to display.color_continuous_scale_Color scale for the plot.marker_size, font_size, title_font_sizeCustomize marker size and font sizes.Plot Details: The plot visualizes the relationship between covariance and correlation for features. Uses Plotly's scatter plot to create an interactive S-plot. The axes are customizable, and the plot is set to be visually clean (e.g., axes lines and tick marks).
plot_loading(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=5, font_size=20, title_font_size=20, xaxis_direction='reversed', xaxis_title='𝛿<sub>H</sub> in ppm')
Plot loading plot
Parameters
fig_heightint, default=900Height of the figure.
fig_widthint, default=2000Width of the figure.
range_colorlist, default=[-0.05,0.05]Range of color_ for the plot.
color_continuous_scale_str, default='jet'color_ scale for the plot.
PurposeThis function generates a loading plot, typically used in multivariate analysis to visualize the relationship between features and the scores.Parameters:
fig_height, fig_widthDimensions of the plot.range_color_Color range to represent the covariance values.color_continuous_scale_Color scale used for continuous color mapping.marker_size, font_size, title_font_sizeCustomize the appearance of the markers and fonts.xaxis_directionSet the direction for the x-axis (e.g., reversed or not).xaxis_titleTitle for the x-axis.Plot Details: The loading plot shows the relationship between features (usually a set of variables or compounds) and the model scores. The correlation for each feature is displayed alongside its covariance, with colors representing the covariance values. The plot is interactive and customizable (e.g., marker size, color scale, axis settings).