HGCN Streaming Dataset¶
create_streaming_dataset(token_cache, cfg)
¶
Create streaming dataset that yields triplets with tokenized embeddings.
create_streaming_generator(cfg)
¶
Create a generator that yields triplets for training.
load_streaming_triplets(cfg, *, worker_id='Main', allow_cache_save=True, log_stats=True)
¶
Materialize the cached streaming triplets for reuse outside of DataLoader workers.
Uses taxonomy-based stratified positive sampling: - Stratum 0 (descendants): for levels 2-5, next-level descendants - Stratum 1 (ancestors): for levels 3-6, parent codes up to level 2 - Stratum 2 (siblings): codes sharing the same parent
Samples up to 4 positives per stratum, then samples negatives for each.