Streaming Dataset API

build_multi_epoch_triplets(cfg, sampling_cfg, n_epochs=100)

Build triplet rows for multiple epochs with different seeds per epoch.

Each epoch gets a different random seed for negative sampling, providing diversity in training examples across epochs.

Parameters:

Name Type Description Default
cfg StreamingConfig

Streaming configuration

required
sampling_cfg SamplingConfig

Sampling configuration

required
n_epochs int

Number of epochs to pre-sample (default: 100)

100

Returns:

Type Description
List[Dict[str, Any]]

List of triplet rows for all epochs combined

create_streaming_dataset(token_cache, cfg, sampling_cfg=None)

Create streaming dataset that yields per-positive triplets with tokenized embeddings.

create_streaming_generator(cfg, sampling_cfg=None)

Create a generator that yields triplets for training, using cached data when available.