metbit.analysis.large_scale
Memory-efficient algorithms for large-scale metabolomics data.
import metbit.analysis.large_scaleClasses
MemoryEstimator
Estimate RAM requirements before loading large datasets.
Methods
estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=1)
Return a dict with estimated byte counts and a human-readable summary.
Parameters
n_samplesintn_featuresintdtypenumpy dtypeStorage dtype. float64=8 bytes, float32=4 bytes.
copiesintNumber of simultaneous matrix copies (e.g., 2 for train/test split).
print_estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=2)
ChunkedSTOCSY
Memory-efficient STOCSY for datasets with large feature counts.
Replaces the full-matrix centered copy in the standard STOCSY with chunked Pearson correlation. Peak memory is O(n_samples * chunk_size) instead of O(n_samples * n_features).
Parameters
chunk_sizeintNumber of features processed per chunk. - 50,000 features @ 10,000 samples (float64) ~ 4 GB peak - Reduce if RAM is constrained; increase for throughput.
p_value_thresholdfloatSignificance threshold for highlighting correlations in the plot.
Methods
__init__(self, chunk_size: int=50000, p_value_threshold: float=0.0001)
active_backend()
Return the current compute backend configuration.
compute(self, spectra: pd.DataFrame, anchor_ppm_value: float)
Compute correlations and two-sided p-values.
Returns
ppmndarray (n_features,)correlationsndarray (n_features,)p_valuesndarray (n_features,)plot(self, spectra: pd.DataFrame, anchor_ppm_value: float)
Compute and return a Plotly figure identical in style to STOCSY().
Compatible with existing downstream code that calls .show() or saves the figure.
LargeScaleAlignment
Alignment wrapper that avoids full matrix copies.
Delegates to icoshift_align but uses a single numpy allocation for the output, avoiding the two-copy pattern in the original implementation.
Parameters
chunk_sizeintSpectra processed per batch (for very large sample counts). Currently all spectra are processed together; chunk_size is reserved for future batched support.
Methods
__init__(self, chunk_size: int=500)
align(self, spectra: pd.DataFrame, ppm: np.ndarray, windows, reference: str='median', max_shift_ppm: float=0.02)
Align spectra using icoshift with memory-efficient allocation.
Functions
feature_preselection(X: Union[pd.DataFrame, np.ndarray], percentile: float=20.0, method: str='variance', chunk_size: int=100000)
Remove low-information features before modeling.
Scientifically justifiedremoves spectral bins that carry only instrumentnoise. The threshold is data-driven (percentile of the distribution) rather than a fixed constant.
Parameters
XDataFrame or ndarray, shape (n_samples, n_features)percentilefloatRemove features below this percentile of the score distribution. 20 removes the bottom fifth; 0 keeps everything.
methodstr'variance' - inter-sample variance (fast, one-pass) 'iqr' - interquartile range (robust to outliers, slower)
chunk_sizeintFeatures processed per chunk to bound peak memory.
Returns
X_reducedsame type as X, shape (n_samples, n_kept)maskbool ndarray, shape (n_features,) - True for kept featureschunked_pearson(matrix: np.ndarray, anchor_index: int, chunk_size: int=50000, out_dtype: type=np.float64)
Pearson correlation between one column and all other columns.
Delegates to the _native dispatch layer which selects the fastest
available backendGPU > C+OpenMP > multiprocessing > chunked NumPy.Parameters
matrixndarray, shape (n_samples, n_features)anchor_indexintchunk_sizeintFeature chunk size used by the CPU fallback paths.
memory_report(X: Union[pd.DataFrame, np.ndarray])
Print a memory usage report for a given dataset.