time_split#

Time-based k-fold validation splits for heterogeneous data.

Functions

`log_split_progress`(splits, *[, logger, ...])	Log iteration progress.
`plot`(schedule, *[, before, after, step, ...])	Fold visualization.
`split`(schedule, *[, before, after, step, ...])	Create time-based cross-validation splits.

log_split_progress(splits: Sequence[DatetimeSplitBounds], *, logger: Logger | LoggerAdapter[Any] | str = 'time_split', start_level: int = 20, end_level: int = 20, extra: dict[str, Any] | None = None, get_metrics: Callable[[Timestamp], MetricsType] | None = None) → LogSplitProgress[source]#

Log iteration progress.

Parameters:

splits – Splits to iterate over.
logger – Logger or logger name to use.
start_level – Log level to use for the fold-begin message.
end_level – Log level to use for the fold-end message.
extra – Immutable, user-defined extra-arguments to use when logging, merged with progress-related extras (see SplitProgressExtras).
get_metrics – A callable (training_date) -> fold_metrics | str (see training_date). If given, metrics are added to the fold-end message. The message is formatted using the default formatter unless FORMAT_METRICS is set. If this callback returns a str argument, the default formatter will assume that the metrics are pre-formatted, simply appending the formatted metrics to the fold-end message as-is.

Returns:

A LogSplitProgress object.

Examples

Configuring the logger name and fold-begin message log level.

>>> from time_split import split, log_split_progress
>>> schedule = ["2023-08-16", "2023-08-17 12", "2023-08-19"]
>>> tracked_splits = log_split_progress(
...     split(schedule),
...     logger="progress",
...     start_level=logging.DEBUG,
... )
>>> list(splits)  
[progress:DEBUG] Begin fold 1/2: '2023-08-09' <= [schedule: '2023-08-16' (Wednesday)] < '2023-08-17 12:00:00'.
[progress:INFO] Finished fold 1/2 [schedule: '2023-08-16' (Wednesday)] after 5m 18s.
[progress:DEBUG] Begin fold 2/2: '2023-08-10 12:00:00' <= [schedule: '2023-08-17 12:00:00' (Thursday)] < '2023-08-19'.
[progress:INFO] Finished fold 2/2 [schedule: '2023-08-17 12:00:00' (Thursday)] after 4m 3s.

Using the get_metrics callback argument.

>>> metrics = {
...     "2023-08-16 00:00:00": {"rmse": {"train": 0.11, "test": 0.5}},
...     "2023-08-17 12:00:00": {"rmse": {"test": 0.5, "future": 20.19}},
... }
>>> tracked_splits = log_split_progress(
...     split(schedule),
...     get_metrics=lambda key: metrics[str(key)],
... )
>>> list(tracked_splits)  
[time_split:INFO] Begin fold 1/2: '2023-08-09' <= [schedule: '2023-08-16' (Wednesday)] < '2023-08-17 12:00:00'.
[time_split:INFO] Finished fold 1/2 [schedule: '2023-08-16' (Wednesday)] after 5m 18s. Fold metrics:
rmse.train   0.11
rmse.test     0.5
[time_split:INFO] Begin fold 2/2: '2023-08-10 12:00:00' <= [schedule: '2023-08-17 12:00:00' (Thursday)] < '2023-08-19'.
[time_split:INFO] Finished fold 2/2 [schedule: '2023-08-17 12:00:00' (Thursday)] after 4m 3s. Fold metrics:
rmse.test       0.5
rmse.future   20.19

Formatting was done using the default formatter, since the FORMAT_METRICS setting is None.

Fold visualization.

This function plots the folds and in-fold splits that would be made by passing the same arguments to the split()-function.

Parameters:

schedule – A DatetimeIterable, pandas offset alias, or cron expression.
before – Range before schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).
after – Range after schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).
step – Select a subset of folds, preferring folds later in the schedule.
n_splits – Maximum number of folds, preferring folds later in the schedule.
available – Binds schedule to a range. If bar_labels is given but is not a list, this data will be used to compute fold sizes.
expand_limits – A pandas offset alias used to expand available data to its likely “true” limits. Pass False to disable.
filter – A callable (start, mid, end) -> bool applied to each fold. Strings are converted using get_by_full_name().
ignore_filters – Set to ignore filtering parameters (e.g. step and filter).
bar_labels – Labels to draw on the bars. If you pass a string, it will be interpreted as a time unit (see Offset aliases for valid frequency strings). Bars will show the number of units contained. Pass ‘rows’ to simply count the numbers of elements in data (if given). To write custom bar labels, pass a list [(data_label, future_data_label), ...], one tuple for each fold. This may be used to write metric values per data set after cross validation.
show_removed – If True, splits removed by n_splits or step are included in the figure.
row_count_bin – A pandas offset alias. If given, show normalized row count per row_count_bin in the background. Pass pandas.Series to use pre-computed row counts.
ax – Axis to use for plotting. If None, create new axes.

For more information about the schedule, before/after and expand_limits-arguments, see the User guide.

Returns:: Matplitlib axes.
Raises:: ValueError – For invalid plot/split argument combinations.

Examples

Cron schedule, keeping all data before the schedule.

Filters: minimum before='all' size.

List-schedule, without available data.

Removing folds with n_splits. Schedule-based before and after-data.

Plotting metrics per fold and data set.

Fold sampling using the step-argument.

Timedelta-based schedule and after arguments.

Create time-based cross-validation splits.

To visualize the folds, pass the same arguments to the plot()-function.

Parameters:

schedule –
A DatetimeIterable, pandas offset alias, or cron expression.
before – Range before schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).
after – Range after schedule timestamps. Either a pandas offset alias, an integer (schedule-based offsets), or ‘all’ (requires available data).
step – Select a subset of folds, preferring folds later in the schedule.
n_splits – Maximum number of folds, preferring folds later in the schedule.
available – Binds schedule to a range. Passing a tuple (min, max) is enough.
expand_limits – A pandas offset alias used to expand available data to its likely “true” limits. Pass False to disable.
filter – A callable (start, mid, end) -> bool applied to each fold. Strings are converted using get_by_full_name().
ignore_filters – Set to ignore filtering parameters (e.g. step and filter).

For more information about the schedule, before/after and expand_limits-arguments, see the User guide.

Returns:: A list of tuples [(start, mid, end), ...].

Examples

Cron schedule, keeping all data before the schedule.

Filters: minimum before='all' size.

List-schedule, without available data.

Removing folds with n_splits. Schedule-based before and after-data.

Plotting metrics per fold and data set.

Fold sampling using the step-argument.

Timedelta-based schedule and after arguments.

Modules

`app`	Supporting functions for the Streamlit companion app.
`cli`	CLI entrypoint.
`integration`	Convenience functions and classes for common libraries.
`settings`	Global settings for the splitting logic.
`support`	Supporting functions.
`types`	Types related to splitting data.