Skip to content

preprocessing

boxcox(method='mle')

Applies the Box-Cox transformation to numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
method str

The method used to determine the lambda parameter of the Box-Cox transformation.

Supported methods:

  • mle: maximum likelihood estimation
  • pearsonr: Pearson correlation coefficient
'mle'

coerce_dtypes(schema)

Coerces the column datatypes of a DataFrame using the provided schema.

Parameters:

Name Type Description Default
schema Mapping[str, DataType]

A dictionary-like object mapping column names to the desired data types.

required

deseasonalize_fourier(sp, K, robust=False)

Removes seasonality via residualized regression with Fourier terms.

Parameters:

Name Type Description Default
sp int

Seasonal period.

required
K int

Maximum order(s) of Fourier terms. Must be less than sp.

required

Note: part of this transformer uses sklearn under-the-hood: it is not pure Polars and lazy.

detrend(freq, method='linear')

Removes mean or linear trend from numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
freq str

Offset alias supported by Polars.

required
method str

If mean, subtracts mean from each time-series. If linear, subtracts line of best-fit (via OLS) from each time-series. Defaults to linear.

'linear'

diff(order, sp=1, fill_strategy=None)

Difference time-series in panel data given order and seasonal period.

Parameters:

Name Type Description Default
order int

The order to difference.

required
sp int

Seasonal periodicity.

1
fill_strategy Optional[str]

Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"].

None

fractional_diff(d, min_weight=None, window_size=None)

Compute the fractional differential of a time series.

This particular functionality is referenced in Advances in Financial Machine Learning by Marcos Lopez de Prado (2018).

For feature creation purposes, it is suggested that the minimum value of d is used that removes stationarity from the time series. This can be achieved by running the augmented dickey-fuller test on the time series for different values of d and selecting the minimum value that makes the time series stationary.

Parameters:

Name Type Description Default
d float

The fractional order of the differencing operator.

required
min_weight float

The minimum weight to use for calculations. If specified, the window size is computed from this value and not needed.

None
window_size int

The window size of the fractional differencing operator. If specified, the minimum weight is not needed.

None

impute(method)

Performs missing value imputation on numeric columns of a DataFrame grouped by entity.

Parameters:

Name Type Description Default
method Union[str, int, float]

The imputation method to use.

Supported methods are:

  • 'mean': Replace missing values with the mean of the corresponding column.
  • 'median': Replace missing values with the median of the corresponding column.
  • 'fill': Replace missing values with the mean for float columns and the median for integer columns.
  • 'ffill': Forward fill missing values.
  • 'bfill': Backward fill missing values.
  • 'interpolate': Interpolate missing values using linear interpolation.
  • int or float: Replace missing values with the specified constant.
required

lag(lags, is_sorted=False)

Applies lag transformation to a LazyFrame. The time series is assumed to have no null values.

Parameters:

Name Type Description Default
lags List[int]

A list of lag values to apply.

required
is_sorted bool

If already sorted by entity and time columns already, this won't sort again and can save some time.

False

one_hot_encode(drop_first=False)

Encode categorical features as a one-hot numeric array.

Parameters:

Name Type Description Default
drop_first bool

Drop the first one hot feature.

False

Raises:

Type Description
ValueError

if X passed into transform_new contains unknown categories.

reindex(drop_duplicates=False)

Reindexes the entity and time columns to have every possible combination of (entity, time).

Parameters:

Name Type Description Default
drop_duplicates bool

Defaults to False. If True, duplicates are dropped before reindexing.

False

resample(freq, agg_method, impute_method)

Resamples and transforms a DataFrame using the specified frequency, aggregation method, and imputation method.

Parameters:

Name Type Description Default
freq str

Offset alias supported by Polars.

required
agg_method str

The aggregation method to use for resampling. Supported values are 'sum', 'mean', and 'median'.

required
impute_method Union[str, int, float]

The method used for imputing missing values. If a string, supported values are 'ffill' (forward fill) and 'bfill' (backward fill). If an int or float, missing values will be filled with the provided value.

required

roll(window_sizes, stats, freq, fill_strategy=None)

Performs rolling window calculations on specified columns of a DataFrame.

Parameters:

Name Type Description Default
window_sizes List[int]

A list of integers representing the window sizes for the rolling calculations.

required
stats List[Literal['mean', 'min', 'max', 'mlm', 'sum', 'std', 'cv']]

A list of statistical measures to calculate for each rolling window.

Supported values are:

  • 'mean' for mean
  • 'min' for minimum
  • 'max' for maximum
  • 'mlm' for maximum minus minimum
  • 'sum' for sum
  • 'std' for standard deviation
  • 'cv' for coefficient of variation
required
freq str

Offset alias supported by Polars.

required
fill_strategy Optional[str]

Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"].

None

scale(use_mean=True, use_std=True, rescale_bool=False)

Performs scaling and rescaling operations on the numeric columns of a DataFrame.

Parameters:

Name Type Description Default
use_mean bool

Whether to subtract the mean from the numeric columns. Defaults to True.

True
use_std bool

Whether to divide the numeric columns by the standard deviation. Defaults to True.

True
rescale_bool bool

Whether to rescale boolean columns to the range [-1, 1]. Defaults to False.

False

time_to_arange(eager=False)

Coerces time column into arange per entity.

Assumes even-spaced time-series and homogeneous start dates.

trim(direction='both')

Trims time-series in panel to have the same start or end dates as the shortest time-series.

Parameters:

Name Type Description Default
direction Literal['both', 'left', 'right']

Defaults to "both". If "left" trims from start date of the shortest time series); if "right" trims up to the end date of the shortest time-series; or otherwise "both" trims between start and end dates of the shortest time-series

'both'

yeojohnson(brack=(-2, 2))

Applies the Yeo-Johnson transformation to numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
brack 2 - tuple

The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.

(-2, 2)