API reference / Analysis and models

You are viewing the documentation for metbit 9.0.0. Change release context

metbit.analysis.large_scale

Memory-efficient algorithms for large-scale metabolomics data.

import metbit.analysis.large_scale

Classes

MemoryEstimator

Estimate RAM requirements before loading large datasets.

Methods

estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=1)

Return a dict with estimated byte counts and a human-readable summary.

Parameters

n_samplesint

n_featuresint

dtypenumpy dtype

Storage dtype. float64=8 bytes, float32=4 bytes.

copiesint

Number of simultaneous matrix copies (e.g., 2 for train/test split).

print_estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=2)

ChunkedSTOCSY

Memory-efficient STOCSY for datasets with large feature counts.

Replaces the full-matrix centered copy in the standard STOCSY with chunked Pearson correlation. Peak memory is O(n_samples * chunk_size) instead of O(n_samples * n_features).

Parameters

chunk_sizeint

Number of features processed per chunk. - 50,000 features @ 10,000 samples (float64) ~ 4 GB peak - Reduce if RAM is constrained; increase for throughput.

p_value_thresholdfloat

Significance threshold for highlighting correlations in the plot.

Methods

__init__(self, chunk_size: int=50000, p_value_threshold: float=0.0001)

active_backend()

Return the current compute backend configuration.

compute(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute correlations and two-sided p-values.

Returns

ppmndarray (n_features,)

correlationsndarray (n_features,)

p_valuesndarray (n_features,)

plot(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute and return a Plotly figure identical in style to STOCSY().

Compatible with existing downstream code that calls .show() or saves the figure.

LargeScaleAlignment

Alignment wrapper that avoids full matrix copies.

Delegates to icoshift_align but uses a single numpy allocation for the output, avoiding the two-copy pattern in the original implementation.

Parameters

chunk_sizeint

Spectra processed per batch (for very large sample counts). Currently all spectra are processed together; chunk_size is reserved for future batched support.

Methods

__init__(self, chunk_size: int=500)

align(self, spectra: pd.DataFrame, ppm: np.ndarray, windows, reference: str='median', max_shift_ppm: float=0.02)

Align spectra using icoshift with memory-efficient allocation.

Functions

feature_preselection(X: Union[pd.DataFrame, np.ndarray], percentile: float=20.0, method: str='variance', chunk_size: int=100000)

Remove low-information features before modeling.

Scientifically justifiedremoves spectral bins that carry only instrument

noise. The threshold is data-driven (percentile of the distribution) rather than a fixed constant.

Parameters

XDataFrame or ndarray, shape (n_samples, n_features)

percentilefloat

Remove features below this percentile of the score distribution. 20 removes the bottom fifth; 0 keeps everything.

methodstr

'variance' - inter-sample variance (fast, one-pass) 'iqr' - interquartile range (robust to outliers, slower)

chunk_sizeint

Features processed per chunk to bound peak memory.

Returns

X_reducedsame type as X, shape (n_samples, n_kept)

maskbool ndarray, shape (n_features,) - True for kept features

chunked_pearson(matrix: np.ndarray, anchor_index: int, chunk_size: int=50000, out_dtype: type=np.float64)

Pearson correlation between one column and all other columns.

Delegates to the _native dispatch layer which selects the fastest

available backendGPU > C+OpenMP > multiprocessing > chunked NumPy.

Parameters

matrixndarray, shape (n_samples, n_features)

anchor_indexint

chunk_sizeint

Feature chunk size used by the CPU fallback paths.

memory_report(X: Union[pd.DataFrame, np.ndarray])

Print a memory usage report for a given dataset.

Source

metbit/analysis/large_scale.py at v9.0.0