Classes
TrainTestSplit
Simple stratified train/test holdout split with Plotly diagnostics.
Args:
XFeature matrix (DataFrame or ndarray).yTarget labels (Series, ndarray, or list).test_sizeFraction of samples in the test set.stratifyWhether to stratify by y (keeps class proportions).random_stateRandom seed for reproducibility.Examples: >>> import numpy as np >>> import pandas as pd >>> from metbit.validation.splitter import TrainTestSplit >>> X = pd.DataFrame(np.random.rand(100, 20)) >>> y = pd.Series(["A"] * 50 + ["B"] * 50) >>> tts = TrainTestSplit(X, y, test_size=0.2) >>> X_train, X_test, y_train, y_test = tts.split() >>> fig = tts.plot_split()
Methods
__init__(self, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray, List[Any]], test_size: float=0.2, stratify: bool=True, random_state: int=42)
split(self)
Perform the split and return (X_train, X_test, y_train, y_test).
Returns: Tuple of (X_train, X_test, y_train, y_test).
Examples: >>> X_train, X_test, y_train, y_test = tts.split()
get_summary(self)
Return a DataFrame summarising train/test class distributions.
Returns:
DataFrame with columnsclass, train_n, test_n, train_pct, test_pct.Examples: >>> summary = tts.get_summary()
plot_split(self, fig_height: int=400, fig_width: int=700, font_size: int=13, title: Optional[str]=None)
Bar chart showing class distribution in train and test sets.
Args:
fig_heightFigure height in pixels.fig_widthFigure width in pixels.font_sizeBase font size.titleOptional plot title.Returns: go.Figure
Examples: >>> fig = tts.plot_split() >>> fig.show()
CrossValidator
Unified cross-validation with multiple splitting strategies and Plotly output.
Supports any sklearn-compatible estimator.
Args:
estimatorsklearn-compatible classifier.XFeature matrix (DataFrame or ndarray).yTarget labels (Series, ndarray, or list).cv_strategySplitting strategy. One of:- ``"kfold"`` – K-Fold (shuffled) - ``"stratified_kfold"`` – Stratified K-Fold (default) - ``"loo"`` – Leave-One-Out - ``"leave_p_out"`` – Leave-P-Out - ``"repeated_kfold"`` – Repeated Stratified K-Fold - ``"shuffle_split"`` – Monte Carlo (ShuffleSplit) - ``"group_kfold"`` – Group K-Fold (requires ``groups``) - ``"time_series"`` – Time Series Split
n_splitsNumber of CV folds (ignored for LOO).scoringScoring metric string (sklearn convention). Default ``"balanced_accuracy"``.random_stateRandom seed.n_jobsParallel jobs for cross_val_score (-1 = all cores).groupsSample group labels for GroupKFold.n_repeatsNumber of repeats for ``"repeated_kfold"``.pP value for ``"leave_p_out"``.test_sizeTest fraction for ``"shuffle_split"``.Examples: >>> import numpy as np >>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> from metbit.validation.splitter import CrossValidator >>> X = pd.DataFrame(np.random.rand(80, 20)) >>> y = pd.Series(["A"] * 40 + ["B"] * 40) >>> cv = CrossValidator(RandomForestClassifier(n_estimators=10), X, y) >>> cv.fit() >>> fig = cv.plot_scores() >>> summary = cv.get_summary()
Methods
__init__(self, estimator: Any, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray, List[Any]], cv_strategy: str='stratified_kfold', n_splits: int=5, scoring: str='balanced_accuracy', random_state: int=42, n_jobs: int=-1, groups: Optional[Union[np.ndarray, List]]=None, n_repeats: int=5, p: int=2, test_size: float=0.2)
fit(self)
Run cross-validation and store per-fold scores.
Returns: self
Examples: >>> cv.fit() >>> print(cv.scores_)
get_scores(self)
Return per-fold scores as a DataFrame.
Returns:
DataFrame with columnsfold, score.Examples: >>> df = cv.get_scores()
get_summary(self)
Return mean, std, min, max of CV scores.
Returns: Single-row DataFrame with summary statistics.
Examples: >>> summary = cv.get_summary()
plot_scores(self, fig_height: int=420, fig_width: int=700, font_size: int=13, title: Optional[str]=None, color: str='#2563eb')
Bar chart of per-fold CV scores with mean ± std annotation.
Args:
fig_heightFigure height in pixels.fig_widthFigure width in pixels.font_sizeBase font size.titleOptional plot title.colorBar color.Returns: go.Figure
Examples: >>> fig = cv.plot_scores() >>> fig.show()
plot_score_distribution(self, fig_height: int=400, fig_width: int=500, font_size: int=13, title: Optional[str]=None)
Box + strip plot of CV score distribution.
Args:
fig_heightFigure height in pixels.fig_widthFigure width in pixels.font_sizeBase font size.titleOptional plot title.Returns: go.Figure
Examples: >>> fig = cv.plot_score_distribution() >>> fig.show()
plot_learning_curve(self, train_sizes: Optional[np.ndarray]=None, fig_height: int=450, fig_width: int=750, font_size: int=13, title: Optional[str]=None, n_jobs: int=-1)
Learning curvetraining size vs. train and CV score.Args:
train_sizesArray of training set fractions. Defaults to``np.linspace(0.1, 1.0, 8)``.
fig_heightFigure height in pixels.fig_widthFigure width in pixels.font_sizeBase font size.titleOptional plot title.n_jobsParallel jobs.Returns: go.Figure
Examples: >>> fig = cv.plot_learning_curve() >>> fig.show()
compare_strategies(self, strategies: Optional[List[str]]=None, fig_height: int=450, fig_width: int=900, font_size: int=13, title: Optional[str]=None)
Run and compare multiple CV strategies side-by-side.
Runs the same estimator under each strategy (using this instance's n_splits and random_state) and returns a grouped box plot.
Args:
strategiesList of strategy names to compare. Defaults to``["kfold", "stratified_kfold", "shuffle_split", "repeated_kfold"]``.
fig_heightFigure height in pixels.fig_widthFigure width in pixels.font_sizeBase font size.titleOptional plot title.Returns: go.Figure
Examples: >>> fig = cv.compare_strategies() >>> fig.show()
Functions
available_cv_strategies()
Return a DataFrame listing all supported CV strategies.
Returns:
DataFrame with columnskey, name.Examples: >>> from metbit.validation.splitter import available_cv_strategies >>> print(available_cv_strategies())