API reference / Analysis and models

You are viewing the documentation for metbit 9.1.0. Change release context

metbit.analysis.large_scale

Memory-efficient algorithms for large-scale metabolomics data.

import metbit.analysis.large_scale

Classes

MemoryEstimator

Estimate RAM requirements before loading large datasets.

Examples: >>> import numpy as np >>> import metbit >>> info = metbit.analysis.large_scale.MemoryEstimator.estimate(10000, 50000, np.float32) >>> print(info["summary"]) >>> metbit.analysis.large_scale.MemoryEstimator.print_estimate(10000, 50000)

Methods

estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=1)

Return a dict with estimated byte counts and a human-readable summary.

Parameters

n_samplesint

n_featuresint

dtypenumpy dtype

Storage dtype. float64=8 bytes, float32=4 bytes.

copiesint

Number of simultaneous matrix copies (e.g., 2 for train/test split).

Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import MemoryEstimator >>> info = MemoryEstimator.estimate(5000, 20000, np.float32, copies=2) >>> info["single_matrix_gb"] >>> info["recommended_dtype"]

print_estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=2)

Print a human-readable memory estimate to stdout.

Parameters

n_samplesint

n_featuresint

dtypenumpy dtype

copiesint

Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import MemoryEstimator >>> MemoryEstimator.print_estimate(10000, 100000, np.float32, copies=2)

ChunkedSTOCSY

Memory-efficient STOCSY for datasets with large feature counts.

Replaces the full-matrix centered copy in the standard STOCSY with chunked Pearson correlation. Peak memory is O(n_samples * chunk_size) instead of O(n_samples * n_features).

Parameters

chunk_sizeint

Number of features processed per chunk. - 50,000 features @ 10,000 samples (float64) ~ 4 GB peak - Reduce if RAM is constrained; increase for throughput.

p_value_thresholdfloat

Significance threshold for highlighting correlations in the plot.

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 2000) >>> spectra = pd.DataFrame(np.random.rand(50, 2000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=500, p_value_threshold=1e-4) >>> ppm_out, r, p = stocsy.compute(spectra, anchor_ppm_value=3.05) >>> fig = stocsy.plot(spectra, anchor_ppm_value=3.05) >>> fig.show()

Methods

__init__(self, chunk_size: int=50000, p_value_threshold: float=0.0001)

active_backend()

Return the current compute backend configuration.

Examples: >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> info = ChunkedSTOCSY.active_backend() >>> info["gpu"] >>> info["native_c"]

compute(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute correlations and two-sided p-values.

Returns

ppmndarray (n_features,)

correlationsndarray (n_features,)

p_valuesndarray (n_features,)

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(80, 1000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=200) >>> ppm_out, r, p_vals = stocsy.compute(spectra, anchor_ppm_value=5.0) >>> r.shape

plot(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute and return a Plotly figure identical in style to STOCSY().

Compatible with existing downstream code that calls .show() or saves the figure.

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(80, 1000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=200) >>> fig = stocsy.plot(spectra, anchor_ppm_value=3.56) >>> fig.show()

LargeScaleAlignment

Alignment wrapper that avoids full matrix copies.

Delegates to icoshift_align but uses a single numpy allocation for the output, avoiding the two-copy pattern in the original implementation.

Parameters

chunk_sizeint

Spectra processed per batch (for very large sample counts). Currently all spectra are processed together; chunk_size is reserved for future batched support.

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import LargeScaleAlignment >>> ppm = np.linspace(10, 0, 500) >>> spectra = pd.DataFrame(np.random.rand(30, 500), columns=ppm) >>> aligner = LargeScaleAlignment(chunk_size=500) >>> windows = [(3.0, 3.5), (5.0, 5.5)] >>> aligned, info = aligner.align(spectra, ppm, windows)

Methods

__init__(self, chunk_size: int=500)

align(self, spectra: pd.DataFrame, ppm: np.ndarray, windows, reference: str='median', max_shift_ppm: float=0.02)

Align spectra using icoshift with memory-efficient allocation.

Parameters

spectrapd.DataFrame, shape (n_samples, n_features)

ppmndarray, shape (n_features,)

windowslist of (float, float)

PPM regions used as alignment targets.

referencestr

Reference spectrum strategy ('median', 'mean', or integer index).

max_shift_ppmfloat

Maximum allowed shift in ppm units.

Returns

aligned_spectrapd.DataFrame

infodict

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import LargeScaleAlignment >>> ppm = np.linspace(10, 0, 500) >>> spectra = pd.DataFrame(np.random.rand(30, 500), columns=ppm) >>> aligner = LargeScaleAlignment(chunk_size=500) >>> windows = [(3.0, 3.5)] >>> aligned, info = aligner.align(spectra, ppm, windows, reference="median") >>> aligned.shape

Functions

feature_preselection(X: Union[pd.DataFrame, np.ndarray], percentile: float=20.0, method: str='variance', chunk_size: int=100000)

Remove low-information features before modeling.

Scientifically justifiedremoves spectral bins that carry only instrument

noise. The threshold is data-driven (percentile of the distribution) rather than a fixed constant.

Parameters

XDataFrame or ndarray, shape (n_samples, n_features)

percentilefloat

Remove features below this percentile of the score distribution. 20 removes the bottom fifth; 0 keeps everything.

methodstr

'variance' - inter-sample variance (fast, one-pass) 'iqr' - interquartile range (robust to outliers, slower)

chunk_sizeint

Features processed per chunk to bound peak memory.

Returns

X_reducedsame type as X, shape (n_samples, n_kept)

maskbool ndarray, shape (n_features,) - True for kept features

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import feature_preselection >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(200, 1000), columns=ppm) >>> X_reduced, mask = feature_preselection(spectra, percentile=20, method="variance") >>> X_reduced.shape >>> mask.sum()

chunked_pearson(matrix: np.ndarray, anchor_index: int, chunk_size: int=50000, out_dtype: type=np.float64)

Pearson correlation between one column and all other columns.

Delegates to the _native dispatch layer which selects the fastest

available backendGPU > C+OpenMP > multiprocessing > chunked NumPy.

Parameters

matrixndarray, shape (n_samples, n_features)

anchor_indexint

chunk_sizeint

Feature chunk size used by the CPU fallback paths.

Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import chunked_pearson >>> spectra = np.random.rand(100, 500).astype(np.float64) >>> r = chunked_pearson(spectra, anchor_index=250, chunk_size=100) >>> r.shape

memory_report(X: Union[pd.DataFrame, np.ndarray])

Print a memory usage report for a given dataset.

Parameters

XDataFrame or ndarray, shape (n_samples, n_features)

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import memory_report >>> X = pd.DataFrame(np.random.rand(500, 10000).astype(np.float32)) >>> memory_report(X)

Source

metbit/analysis/large_scale.py at v9.1.0

metbit.analysis.large_scale

Memory-efficient algorithms for large-scale metabolomics data.

import metbit.analysis.large_scale

Classes

MemoryEstimator

Estimate RAM requirements before loading large datasets.

Methods

estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=1)

Return a dict with estimated byte counts and a human-readable summary.

Parameters

n_samplesint

n_featuresint

dtypenumpy dtype

Storage dtype. float64=8 bytes, float32=4 bytes.

copiesint

Number of simultaneous matrix copies (e.g., 2 for train/test split).

print_estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=2)

Print a human-readable memory estimate to stdout.

Parameters

n_samplesint

n_featuresint

dtypenumpy dtype

copiesint

Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import MemoryEstimator >>> MemoryEstimator.print_estimate(10000, 100000, np.float32, copies=2)

ChunkedSTOCSY

Memory-efficient STOCSY for datasets with large feature counts.

Replaces the full-matrix centered copy in the standard STOCSY with chunked Pearson correlation. Peak memory is O(n_samples * chunk_size) instead of O(n_samples * n_features).

Parameters

chunk_sizeint

Number of features processed per chunk. - 50,000 features @ 10,000 samples (float64) ~ 4 GB peak - Reduce if RAM is constrained; increase for throughput.

p_value_thresholdfloat

Significance threshold for highlighting correlations in the plot.

Methods

__init__(self, chunk_size: int=50000, p_value_threshold: float=0.0001)

active_backend()

Return the current compute backend configuration.

Examples: >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> info = ChunkedSTOCSY.active_backend() >>> info["gpu"] >>> info["native_c"]

compute(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute correlations and two-sided p-values.

Returns

ppmndarray (n_features,)

correlationsndarray (n_features,)

p_valuesndarray (n_features,)

plot(self, spectra: pd.DataFrame, anchor_ppm_value: float)

Compute and return a Plotly figure identical in style to STOCSY().

Compatible with existing downstream code that calls .show() or saves the figure.

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(80, 1000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=200) >>> fig = stocsy.plot(spectra, anchor_ppm_value=3.56) >>> fig.show()

LargeScaleAlignment

Alignment wrapper that avoids full matrix copies.

Delegates to icoshift_align but uses a single numpy allocation for the output, avoiding the two-copy pattern in the original implementation.

Parameters

chunk_sizeint

Spectra processed per batch (for very large sample counts). Currently all spectra are processed together; chunk_size is reserved for future batched support.

Methods

__init__(self, chunk_size: int=500)

align(self, spectra: pd.DataFrame, ppm: np.ndarray, windows, reference: str='median', max_shift_ppm: float=0.02)

Align spectra using icoshift with memory-efficient allocation.

Parameters

spectrapd.DataFrame, shape (n_samples, n_features)

ppmndarray, shape (n_features,)

windowslist of (float, float)

PPM regions used as alignment targets.

referencestr

Reference spectrum strategy ('median', 'mean', or integer index).

max_shift_ppmfloat

Maximum allowed shift in ppm units.

Returns

aligned_spectrapd.DataFrame

infodict

Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import LargeScaleAlignment >>> ppm = np.linspace(10, 0, 500) >>> spectra = pd.DataFrame(np.random.rand(30, 500), columns=ppm) >>> aligner = LargeScaleAlignment(chunk_size=500) >>> windows = [(3.0, 3.5)] >>> aligned, info = aligner.align(spectra, ppm, windows, reference="median") >>> aligned.shape

Functions

feature_preselection(X: Union[pd.DataFrame, np.ndarray], percentile: float=20.0, method: str='variance', chunk_size: int=100000)

Remove low-information features before modeling.

Scientifically justifiedremoves spectral bins that carry only instrument

noise. The threshold is data-driven (percentile of the distribution) rather than a fixed constant.

Parameters

XDataFrame or ndarray, shape (n_samples, n_features)

percentilefloat

Remove features below this percentile of the score distribution. 20 removes the bottom fifth; 0 keeps everything.

methodstr

'variance' - inter-sample variance (fast, one-pass) 'iqr' - interquartile range (robust to outliers, slower)

chunk_sizeint

Features processed per chunk to bound peak memory.

Returns

X_reducedsame type as X, shape (n_samples, n_kept)

maskbool ndarray, shape (n_features,) - True for kept features

chunked_pearson(matrix: np.ndarray, anchor_index: int, chunk_size: int=50000, out_dtype: type=np.float64)

Pearson correlation between one column and all other columns.

Delegates to the _native dispatch layer which selects the fastest

available backendGPU > C+OpenMP > multiprocessing > chunked NumPy.

Parameters

matrixndarray, shape (n_samples, n_features)

anchor_indexint

chunk_sizeint

Feature chunk size used by the CPU fallback paths.

memory_report(X: Union[pd.DataFrame, np.ndarray])

Print a memory usage report for a given dataset.

Parameters

XDataFrame or ndarray, shape (n_samples, n_features)