Streaming Dataset API¶
build_multi_epoch_triplets(cfg, sampling_cfg, n_epochs=100)
¶
Build triplet rows for multiple epochs with different seeds per epoch.
Each epoch gets a different random seed for negative sampling, providing diversity in training examples across epochs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
StreamingConfig
|
Streaming configuration |
required |
sampling_cfg
|
SamplingConfig
|
Sampling configuration |
required |
n_epochs
|
int
|
Number of epochs to pre-sample (default: 100) |
100
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of triplet rows for all epochs combined |
create_streaming_dataset(token_cache, cfg, sampling_cfg=None)
¶
Create streaming dataset that yields per-positive triplets with tokenized embeddings.
create_streaming_generator(cfg, sampling_cfg=None)
¶
Create a generator that yields triplets for training, using cached data when available.