HGCN Streaming Dataset

create_streaming_dataset(token_cache, cfg)

Create streaming dataset that yields triplets with tokenized embeddings.

create_streaming_generator(cfg)

Create a generator that yields triplets for training.

load_streaming_triplets(cfg, *, worker_id='Main', allow_cache_save=True, log_stats=True)

Materialize the cached streaming triplets for reuse outside of DataLoader workers.

Uses taxonomy-based stratified positive sampling: - Stratum 0 (descendants): for levels 2-5, next-level descendants - Stratum 1 (ancestors): for levels 3-6, parent codes up to level 2 - Stratum 2 (siblings): codes sharing the same parent

Samples up to 4 positives per stratum, then samples negatives for each.