metbit.metbit
Analysis and models module in metbit 6.6.0.
import metbit.metbitClasses
opls_da
Methods
__init__(self, X, y, features_name=None, n_components=2, scaling_method='pareto', kfold=3, estimator='opls', random_state=42, auto_ncomp=True)
Purpose: Initializes the model, validates the inputs, and preprocesses the data for further analysis.
Parameters:
• X: array-like (Data matrix) The input features for the model (must be a 2D array-like structure, such as pandas.DataFrame or numpy.ndarray). • y: array-like (Target vector) The target variable or class labels associated with the rows of X (can be a pandas.Series, numpy.ndarray, or list). • features_name: array-like, optional (default=None) List of feature names corresponding to the columns of X. If not provided, column names will be generated or inferred. • n_components: int, optional (default=2) The number of components to use for OPLS-DA analysis. Must be a positive integer. • scaling_method: str, optional (default=‘pareto’) The method used for scaling the data. Options include: • 'pareto': Pareto scaling (power 0.5) • 'mean': Mean-centered data • 'uv': Unit variance scaling (standardization) • 'minmax': Min-max scaling • kfold: int, optional (default=3) The number of folds for cross-validation. • estimator: str, optional (default=‘opls’) The estimator method used for modeling. Default is 'opls' (Orthogonal Projections to Latent Structures). • random_state: int, optional (default=42) The random seed used for reproducibility. • auto_ncomp: bool, optional (default=True) If True, automatically selects the optimal number of components for the model. If False, the number of components is set manually.
Raises:
• ValueError: If any input is invalid, such as mismatched dimensions or incorrect data types.
fit(self)
Purpose: Fits the OPLS-DA model to the data and computes model performance metrics.
Parameters: None
Returns:
• Prints a summary of the model fit, including: • Sample size and class distributions • Number of features and components • Scaling method used • R² and Q² metrics
get_oplsda_scores(self)
Get OPLS-DA scores
get_s_scores(self)
Get S scores
get_oplsda_model(self)
Get OPLS-DA model
get_cv_model(self)
Get cross-validation model
permutation_test(self, n_permutations=500, cv=3, n_jobs=-1, verbose=10)
get_permutation_scores(self)
Get permutation scores
vip_scores(self, model=None, features_name=None)
Get VIP score
Parameters
modelobject, default=NoneOPLS-DA model.
features_namearray-like, shape (n_features,), default=NoneName of features.
get_vip_scores(self, filter_=False, threshold=1)
Get VIP score
Parameters
filter_bool, default=FalseIf True, filter VIP score based on threshold.
thresholdint, default=1Threshold of VIP score.
vip_plot(self, x_range=9, threshold=2, marker_size=12, fig_width=1000, fig_height=500, filter_=False, vip_transform=True, font_size=20, title_font_size=20, xaxis_direction='reversed')
Purpose: Plots the VIP scores for the features, allowing for customization and thresholding.
Parameters:
• x_range: int, optional (default=9) The range of the x-axis in the plot. • threshold: int, optional (default=2) The threshold used to categorize VIP scores into “High” or “Low” importance. • marker_size: int, optional (default=12) The size of the markers in the scatter plot. • fig_width: int, optional (default=1000) The width of the plot. • fig_height: int, optional (default=500) The height of the plot. • filter_: bool, optional (default=False) If True, only features with VIP scores above the threshold are plotted. • vip_transform: bool, optional (default=True) If True, the VIP scores are transformed to reflect their contribution to the model. • font_size: int, optional (default=20) The font size for labels. • title_font_size: int, optional (default=20) The font size for the plot title. • xaxis_direction: str, optional (default=‘reversed’) The direction of the x-axis (‘reversed’ or ‘normal’).
Returns:
• fig: Plotly figure A Plotly scatter plot visualizing the VIP scores for the features.
plot_oplsda_scores(self, x_='t_scores', y_='t_ortho', color_=None, color_dict=None, symbol_=None, symbol_dict=None, fig_height=900, fig_width=1300, marker_size=35, marker_opacity=0.7, marker_label=None, font_size=20, title_font_size=21, legend_name=['Group', 'Time point'], individual_ellipse=True)
Plot OPLS-DA scores plot
Parameters
color_array-like, shape (n_samples,), default=Nonecolor_ of the group. If None, color_ will be based on the group in y.
color_dictdict, default=NoneDictionary of color_ for the group. If None, color_ will be based on the group in y.
symbol_array-like, shape (n_samples,), default=Nonesymbol_ of the group. If None, symbol_ will be based on the group in y.
symbol_dictdict, default=NoneDictionary of symbol_ for the group. If None, symbol_ will be based on the group in y.
fig_heightint, default=900Height of the figure.
fig_widthint, default=1300Width of the figure.
marker_sizeint, default=35Size of the marker.
marker_opacityfloat, default=0.7Opacity of the marker.
• Purpose: This function generates an OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) scores plot, showing how samples are positioned based on their scores in two principal components (t_scores and t_ortho). • Parameters: • x_, y_: The names of the columns in the DataFrame (df_opls_scores) that contain the scores for the x and y axes (default to 't_scores' and 't_ortho'). • color_: An array that assigns a color to each sample (optional). • color_dict: A dictionary of color mappings for the groups (optional). • symbol_: An array of symbols for the samples (optional). • fig_height, fig_width: Dimensions for the plot. • marker_size, marker_opacity: Control the appearance of the markers. • legend_name: Custom labels for the legend. • individual_ellipse: Whether to add ellipses to individual groups (default True). • Plot Details: • Uses Plotly (px.scatter) to create an interactive scatter plot. • Can display confidence ellipses around each group. • Adds annotations for R²X, R²Y, and Q² statistics. • The plot is highly customizable (marker size, opacity, labels, colors, etc.).
plot_hist(self, nbins_=50, fig_height=500, fig_width=1000, font_size=14, title_font_size=20)
Plot histogram of permutation scores
Parameters
nbins_int, default=50Number of bins for histogram.
fig_heightint, default=500Height of the figure.
fig_widthint, default=1000Width of the figure.
• Purpose: This function creates a histogram of permutation scores from a permutation test, commonly used to evaluate model stability and significance. • Parameters: • nbins_: Number of bins in the histogram. • fig_height, fig_width: Dimensions of the plot. • font_size, title_font_size: Font size for labels and title. • Plot Details: • Plots the permutation scores as a histogram with Plotly. • Marks the actual model accuracy score with a red dashed line. • Adds additional annotations for the number of permutations, the accuracy score, and the p-value.
plot_s_scores(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=14, font_size=20, title_font_size=20)
Plot S-plot
Parameters
fig_heightint, default=900Height of the figure.
fig_widthint, default=2000Width of the figure.
range_colorlist, default=[-0.05,0.05]Range of color_ for the plot.
color_continuous_scale_str, default='jet'color_ scale for the plot.
• Purpose: This function generates a scatter plot (S-plot), which visualizes the covariance and correlation between the scores of the model. • Parameters: • fig_height, fig_width: Dimensions of the plot. • range_color_: Range of colors to display. • color_continuous_scale_: Color scale for the plot. • marker_size, font_size, title_font_size: Customize marker size and font sizes. • Plot Details: • The plot visualizes the relationship between covariance and correlation for features. • Uses Plotly’s scatter plot to create an interactive S-plot. • The axes are customizable, and the plot is set to be visually clean (e.g., axes lines and tick marks).
plot_loading(self, fig_height=900, fig_width=2000, range_color_=[-0.05, 0.05], color_continuous_scale_='jet', marker_size=5, font_size=20, title_font_size=20, xaxis_direction='reversed', xaxis_title='𝛿<sub>H</sub> in ppm')
Plot loading plot
Parameters
fig_heightint, default=900Height of the figure.
fig_widthint, default=2000Width of the figure.
range_colorlist, default=[-0.05,0.05]Range of color_ for the plot.
color_continuous_scale_str, default='jet'color_ scale for the plot.
• Purpose: This function generates a loading plot, typically used in multivariate analysis to visualize the relationship between features and the scores. • Parameters: • fig_height, fig_width: Dimensions of the plot. • range_color_: Color range to represent the covariance values. • color_continuous_scale_: Color scale used for continuous color mapping. • marker_size, font_size, title_font_size: Customize the appearance of the markers and fonts. • xaxis_direction: Set the direction for the x-axis (e.g., reversed or not). • xaxis_title: Title for the x-axis. • Plot Details: • The loading plot shows the relationship between features (usually a set of variables or compounds) and the model scores. • The correlation for each feature is displayed alongside its covariance, with colors representing the covariance values. • The plot is interactive and customizable (e.g., marker size, color scale, axis settings).
pca
PCA model
Parameters
Xarray-like, shape (n_samples, n_features)Training data, where n_samples is the number of samples and n_features is the number of features.
labelarray-like, shape (n_samples,)Target data, where n_samples is the number of samples.
features_namearray-like, shape (n_features,), default=NoneName of features.
n_componentsint, default=2Number of components to keep.
scalestr, default='pareto'Method of scaling. 'pareto' for pareto scaling, 'mean' for mean centering, 'uv' for unitvarian scaling.
random_stateint, default=42Random state for permutation test.
test_sizefloat, default=0.3Size of test set.
Examples:
import pandas as pd import numpy as np from metbit import pca
# Create a dataset data = pd.DataFrame(np.random.rand(500, 50000)) class_ = pd.Series(np.random.choice(['A', 'B', 'C'], 500), name='Group') time = pd.Series(np.random.choice(['1-wk', '2-wk', '3-wk', '4-wk'], 500), name='Time point')
# Assign X and target X = datasets.iloc[:, 2:] y = datasets['Group'] time = datasets['Time point'] features_name = list(X.columns.astype(float))
## Perform PCA model
pca_mod = pca(X = X, label = y, features_name=features_name, n_components=2, scaling_method='pareto', random_state=42, test_size=0.3) pca_mod.fit()
# Visualisation of PCA model pca_mod.plot_observe_variance()
pca_mod.plot_cumulative_observed()
shape_ = {'1-wk': 'circle', '2-wk': 'square', '3-wk': 'diamond', '4-wk': 'cross'}
pca_mod.plot_pca_scores(symbol=time, symbol_dict=shape_)
pca_mod.plot_loading_()
pca_mod.plot_pca_trajectory(time_=time, time_order={'1-wk': 0, '2-wk': 1, '3-wk': 2, '4-wk': 3}, color_dict={'A': '#636EFA', 'B': '#EF553B', 'C': '#00CC96'}, symbol_dict=shape_)
Methods
__init__(self, X: pd.DataFrame, label: list=None, features_name: list=None, n_components=2, scaling_method='pareto', random_state=42, test_size=0.3)
fit(self)
get_explained_variance(self)
get_scores(self)
get_loadings(self)
get_q2_test(self)
plot_observe_variance(self, fig_height=600, fig_width=800, font_size=15)
Visualise explained variance plot
Returns
figplotly.graph_objects.FigureExplained variance plot.
plot_cumulative_observed(self, fig_height=600, fig_width=800, font_size=15, marker_size=10)
Visualise cumulative variance plot
Returns
figplotly.graph_objects.FigureCumulative variance plot.
plot_pca_scores(self, pc=['PC1', 'PC2'], color_=None, color_dict=None, symbol_=None, symbol_dict=None, marker_label=None, fig_height=900, fig_width=1300, marker_size=35, marker_opacity=0.7, font_size=20, title_font_size=21, individual_ellipse=True, legend_name=['Group', 'Time point'])
Visualise PCA scores plot
Parameters
pclist, default=['PC1', 'PC2']List of principal components to plot.
colorarray-like, shape (n_samples,), default=NoneTarget data, where n_samples is the number of samples.
color_dictdict, default=NoneDictionary of color_ mapping.
symbol_array-like, shape (n_samples,), default=NoneTarget data, where n_samples is the number of samples.
symbol_dictdict, default=NoneDictionary of symbol_ mapping.
fig_heightint, default=900Height of figure.
fig_widthint, default=1300Width of figure.
marker_sizeint, default=35Size of marker.
marker_opacityfloat, default=0.7Opacity of marker.
text_array-like, shape (n_samples,), default=NoneText to display on each point.
Returns
figplotly.graph_objects.FigurePCA scores plot.
plot_loading_(self, pc=['PC1', 'PC2'], fig_height=600, fig_width=1800, font_size=20, title_font_size=20, marker_size=1, x_axis_title='𝛿<sub>H</sub> in ppm', xaxis_direction='reversed')
Visualise PCA loadings
Parameters
pclist, default=['PC1', 'PC2']Principle component to plot.
fig_heightint, default=600Height of figure.
fig_widthint, default=1800Width of figure.
Returns
figplotly.graph_objects.FigurePlotly figure.
----------
plot_pca_trajectory(self, time_, time_order, stat_=['mean', 'sem'], pc=['PC1', 'PC2'], color_dict=None, symbol_dict=None, fig_height=900, fig_width=1300, marker_size=35, marker_opacity=0.7, title_font_size=20, font_size=20, legend_name=['Group', 'Time point'])
Visualise PCA trajectory
Parameters
time_array-like, shape (n_samples,)Time point of samples.
time_orderdictionaryOrder of time point.
stat_list, default=['mean', 'sem']Statistic to calculate. First element is mean or median, second element is sem or std.
pclist, default=['PC1', 'PC2']Principle component to plot.
color_dictdictionary, default=NoneDictionary of color_ for each group.
symbol_dictdictionary, default=NoneDictionary of symbol_ for each time point.
fig_heightint, default=900Height of figure.
fig_widthint, default=1300Width of figure.
marker_sizeint, default=35Size of marker.
marker_opacityfloat, default=0.7Opacity of marker.
Returns
figplotly.graph_objects.FigurePlotly figure.