Classes
MemoryEstimator
Estimate RAM requirements before loading large datasets.
Examples: >>> import numpy as np >>> import metbit >>> info = metbit.analysis.large_scale.MemoryEstimator.estimate(10000, 50000, np.float32) >>> print(info["summary"]) >>> metbit.analysis.large_scale.MemoryEstimator.print_estimate(10000, 50000)
Methods
estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=1)
Return a dict with estimated byte counts and a human-readable summary.
Parameters
n_samplesintn_featuresintdtypenumpy dtypeStorage dtype. float64=8 bytes, float32=4 bytes.
copiesintNumber of simultaneous matrix copies (e.g., 2 for train/test split).
Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import MemoryEstimator >>> info = MemoryEstimator.estimate(5000, 20000, np.float32, copies=2) >>> info["single_matrix_gb"] >>> info["recommended_dtype"]
print_estimate(n_samples: int, n_features: int, dtype: type=np.float64, copies: int=2)
Print a human-readable memory estimate to stdout.
Parameters
n_samplesintn_featuresintdtypenumpy dtypecopiesintExamples: >>> import numpy as np >>> from metbit.analysis.large_scale import MemoryEstimator >>> MemoryEstimator.print_estimate(10000, 100000, np.float32, copies=2)
ChunkedSTOCSY
Memory-efficient STOCSY for datasets with large feature counts.
Replaces the full-matrix centered copy in the standard STOCSY with chunked Pearson correlation. Peak memory is O(n_samples * chunk_size) instead of O(n_samples * n_features).
Parameters
chunk_sizeintNumber of features processed per chunk. - 50,000 features @ 10,000 samples (float64) ~ 4 GB peak - Reduce if RAM is constrained; increase for throughput.
p_value_thresholdfloatSignificance threshold for highlighting correlations in the plot.
Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 2000) >>> spectra = pd.DataFrame(np.random.rand(50, 2000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=500, p_value_threshold=1e-4) >>> ppm_out, r, p = stocsy.compute(spectra, anchor_ppm_value=3.05) >>> fig = stocsy.plot(spectra, anchor_ppm_value=3.05) >>> fig.show()
Methods
__init__(self, chunk_size: int=50000, p_value_threshold: float=0.0001)
active_backend()
Return the current compute backend configuration.
Examples: >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> info = ChunkedSTOCSY.active_backend() >>> info["gpu"] >>> info["native_c"]
compute(self, spectra: pd.DataFrame, anchor_ppm_value: float)
Compute correlations and two-sided p-values.
Returns
ppmndarray (n_features,)correlationsndarray (n_features,)p_valuesndarray (n_features,)Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(80, 1000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=200) >>> ppm_out, r, p_vals = stocsy.compute(spectra, anchor_ppm_value=5.0) >>> r.shape
plot(self, spectra: pd.DataFrame, anchor_ppm_value: float)
Compute and return a Plotly figure identical in style to STOCSY().
Compatible with existing downstream code that calls .show() or saves the figure.
Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import ChunkedSTOCSY >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(80, 1000), columns=ppm) >>> stocsy = ChunkedSTOCSY(chunk_size=200) >>> fig = stocsy.plot(spectra, anchor_ppm_value=3.56) >>> fig.show()
LargeScaleAlignment
Alignment wrapper that avoids full matrix copies.
Delegates to icoshift_align but uses a single numpy allocation for the output, avoiding the two-copy pattern in the original implementation.
Parameters
chunk_sizeintSpectra processed per batch (for very large sample counts). Currently all spectra are processed together; chunk_size is reserved for future batched support.
Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import LargeScaleAlignment >>> ppm = np.linspace(10, 0, 500) >>> spectra = pd.DataFrame(np.random.rand(30, 500), columns=ppm) >>> aligner = LargeScaleAlignment(chunk_size=500) >>> windows = [(3.0, 3.5), (5.0, 5.5)] >>> aligned, info = aligner.align(spectra, ppm, windows)
Methods
__init__(self, chunk_size: int=500)
align(self, spectra: pd.DataFrame, ppm: np.ndarray, windows, reference: str='median', max_shift_ppm: float=0.02)
Align spectra using icoshift with memory-efficient allocation.
Parameters
spectrapd.DataFrame, shape (n_samples, n_features)ppmndarray, shape (n_features,)windowslist of (float, float)PPM regions used as alignment targets.
referencestrReference spectrum strategy ('median', 'mean', or integer index).
max_shift_ppmfloatMaximum allowed shift in ppm units.
Returns
aligned_spectrapd.DataFrameinfodictExamples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import LargeScaleAlignment >>> ppm = np.linspace(10, 0, 500) >>> spectra = pd.DataFrame(np.random.rand(30, 500), columns=ppm) >>> aligner = LargeScaleAlignment(chunk_size=500) >>> windows = [(3.0, 3.5)] >>> aligned, info = aligner.align(spectra, ppm, windows, reference="median") >>> aligned.shape
Functions
feature_preselection(X: Union[pd.DataFrame, np.ndarray], percentile: float=20.0, method: str='variance', chunk_size: int=100000)
Remove low-information features before modeling.
Scientifically justifiedremoves spectral bins that carry only instrumentnoise. The threshold is data-driven (percentile of the distribution) rather than a fixed constant.
Parameters
XDataFrame or ndarray, shape (n_samples, n_features)percentilefloatRemove features below this percentile of the score distribution. 20 removes the bottom fifth; 0 keeps everything.
methodstr'variance' - inter-sample variance (fast, one-pass) 'iqr' - interquartile range (robust to outliers, slower)
chunk_sizeintFeatures processed per chunk to bound peak memory.
Returns
X_reducedsame type as X, shape (n_samples, n_kept)maskbool ndarray, shape (n_features,) - True for kept featuresExamples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import feature_preselection >>> ppm = np.linspace(10, 0, 1000) >>> spectra = pd.DataFrame(np.random.rand(200, 1000), columns=ppm) >>> X_reduced, mask = feature_preselection(spectra, percentile=20, method="variance") >>> X_reduced.shape >>> mask.sum()
chunked_pearson(matrix: np.ndarray, anchor_index: int, chunk_size: int=50000, out_dtype: type=np.float64)
Pearson correlation between one column and all other columns.
Delegates to the _native dispatch layer which selects the fastest
available backendGPU > C+OpenMP > multiprocessing > chunked NumPy.Parameters
matrixndarray, shape (n_samples, n_features)anchor_indexintchunk_sizeintFeature chunk size used by the CPU fallback paths.
Examples: >>> import numpy as np >>> from metbit.analysis.large_scale import chunked_pearson >>> spectra = np.random.rand(100, 500).astype(np.float64) >>> r = chunked_pearson(spectra, anchor_index=250, chunk_size=100) >>> r.shape
memory_report(X: Union[pd.DataFrame, np.ndarray])
Print a memory usage report for a given dataset.
Parameters
XDataFrame or ndarray, shape (n_samples, n_features)Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.analysis.large_scale import memory_report >>> X = pd.DataFrame(np.random.rand(500, 10000).astype(np.float32)) >>> memory_report(X)