Skip to contents

Preprocess, tune, train, and test supervised learning models in a single call using nested resampling.

Usage

train(
  x,
  dat_validation = NULL,
  dat_test = NULL,
  weights = NULL,
  algorithm = NULL,
  preprocessor_config = NULL,
  hyperparameters = NULL,
  tuner_config = NULL,
  outer_resampling_config = NULL,
  execution_config = setup_ExecutionConfig(),
  question = NULL,
  outdir = NULL,
  verbosity = 1L,
  ...
)

Arguments

x

tabular data, i.e. data.frame, data.table, or tbl_df (tibble): Training set data.

dat_validation

tabular data: Validation set data.

dat_test

tabular data: Test set data.

weights

Optional vector of case weights.

algorithm

Character: Algorithm to use. Can be left NULL, if hyperparameters is defined.

preprocessor_config

PreprocessorConfig object or NULL: Setup using setup_Preprocessor.

hyperparameters

Hyperparameters object: Setup using one of setup_* functions.

tuner_config

TunerConfig object: Setup using setup_GridSearch.

outer_resampling_config

ResamplerConfig object or NULL: Setup using setup_Resampler. This defines the outer resampling method, i.e. the splitting into training and test sets for the purpose of assessing model performance. If NULL, no outer resampling is performed, in which case you might want to use a dat_test dataset to assess model performance on a single test set.

execution_config

ExecutionConfig object: Setup using setup_ExecutionConfig. This allows you to set backend ("future", "mirai", or "none"), number of workers, and future plan if using backend = "future".

question

Optional character string defining the question that the model is trying to answer.

outdir

Character, optional: String defining the output directory.

verbosity

Integer: Verbosity level.

...

Not used.

Value

Object of class Regression(Supervised), RegressionRes(SupervisedRes), Classification(Supervised), or ClassificationRes(SupervisedRes).

Details

Online book & documentation

See rdocs.rtemis.org/train for detailed documentation.

Binary Classification

For binary classification, the outcome should be a factor where the 2nd level corresponds to the positive class.

Resampling

Note that you should not use an outer resampling method with replacement if you will also be using an inner resampling (for tuning). The duplicated cases from the outer resampling may appear both in the training and test sets of the inner resamples, leading to underestimated test error.

Reproducibility

If using outer resampling, you can set a seed when defining outer_resampling_config, e.g.

outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold", seed = 2026L)

If using tuning with inner resampling, you can set a seed when defining tuner_config, e.g.

tuner_config = setup_GridSearch(
  resampler_config = setup_Resampler(n_resamples = 5L, type = "KFold", seed = 2027L)
)

Parallelization

There are three levels of parallelization that may be used during training:

  1. Algorithm training (e.g. a parallelized learner like LightGBM)

  2. Tuning (inner resampling, where multiple resamples can be processed in parallel)

  3. Outer resampling (where multiple outer resamples can be processed in parallel)

The train() function will automatically manage parallelization depending on:

  • The number of workers specified by the user using n_workers

  • Whether the training algorithm supports parallelization itself

  • Whether hyperparameter tuning is needed

Author

EDG

Examples

# \donttest{
iris_c_lightRF <- train(
   iris,
   algorithm = "LightRF",
   outer_resampling_config = setup_Resampler(),
)
#> Error in UseMethod("train"): no applicable method for 'train' applied to an object of class "data.frame"
# }