Training Guide¶
This guide explains how to train the NAICS Hyperbolic Embedding System with the dynamic Structure-Aware Dynamic Curriculum (SADC) scheduler. The current workflow uses a single configuration file (conf/config.yaml) to control model, data, trainer, and curriculum settings.
Table of Contents¶
- Training Guide
- Table of Contents
- Quick Start
- SADC Scheduler
- CLI Reference
- Resuming and Overrides
- Sampling Architecture: Data Layer vs Model Layer
- Migration for Legacy Chains
- Troubleshooting
Quick Start¶
Preprocess data first (see docs/usage.md), then launch training with the default config:
uv run naics-embedder train --config conf/config.yaml
Apply overrides inline—SADC will stay active and adapt phases automatically:
uv run naics-embedder train --config conf/config.yaml \
training.learning_rate=1e-4 training.trainer.max_epochs=20
SADC Scheduler¶
The scheduler runs three phases within a single training invocation:
- Structural Initialization (0–30%)
- Flags:
use_tree_distance,mask_siblings - Effect: weights negatives by inverse tree distance and masks siblings.
- Geometric Refinement (30–70%)
- Flags:
enable_hard_negative_mining,enable_router_guided_sampling - Effect: activates Lorentzian hard-negative mining and router-guided MoE sampling.
- False Negative Mitigation (70–100%)
- Flags:
enable_clustering - Effect: enables clustering-driven false-negative elimination.
Phase boundaries are derived from the trainer's max_epochs. Flag transitions are logged to help
verify when each mechanism is active.
Two additional knobs were added for the experimentation tracks in Issue #44:
curriculum.phase_mode=two_phasemerges Phase 3 behaviors into Phase 2 for a simpler two-stage schedule.curriculum.anneal.*enables continuous schedules (e.g., annealing the tree-distance exponent or router mix ratio overepochsor when a metric threshold is reached).
CLI Reference¶
Use the train command for all new runs:
uv run naics-embedder train --config conf/config.yaml
Key options:
--config PATH— Base config file (default:conf/config.yaml).--ckpt-path PATH— Resume from a checkpoint or uselastto pick the most recent run artifact.--skip-validation— Bypass pre-flight validation when inputs are already verified.OVERRIDES...— Space-separated config overrides (e.g.,training.learning_rate=1e-4).
Resuming and Overrides¶
Resume the latest checkpoint produced under the current experiment name:
uv run naics-embedder train --ckpt-path last
Override trainer settings without editing YAML:
uv run naics-embedder train \
training.trainer.max_epochs=15 training.trainer.accumulate_grad_batches=4
Sampling Architecture: Data Layer vs Model Layer¶
This page clarifies the split between the streaming data pipeline and the model during curriculum-driven training.
Data Layer (Streaming Dataset)¶
- Build candidate pools from precomputed triplets and taxonomy indices.
- Phase 1 sampling:
- Inverse tree-distance weighting (
P(n) ∝ 1 / d_tree(a, n)^α). - Sibling masking (
d_tree <= 2set to zero). - Explicit exclusion mining (
excluded_codesmap) with high-priority weights and anexplicit_exclusionflag on sampled negatives. - Static baseline (SANS):
- Set
sampling.strategy=sans_staticto replace the dynamic weighting with fixed near/far buckets. - Configure bucket ratios under
sampling.sans_static(e.g.,near_bucket_weight,near_distance_threshold). - The dataloader emits
sampling_metadataso the model can log near/far percentages per batch, making it easier to benchmark against Issue #43. - Outputs:
- Tokenized anchors/positives.
- Negatives annotated with
relation_margin,distance_margin, andexplicit_exclusion. - Shared negatives reused for multi-level positives (ancestor supervision).
Model Layer (NAICSContrastiveModel)¶
The model is decomposed into functional mixins for maintainability:
| Mixin | Responsibility |
|---|---|
DistributedMixin |
Global batch sampling utilities for multi-GPU training |
LossMixin |
Hierarchy loss, LambdaRank loss, radius regularization |
CurriculumMixin |
Hard negative mining, router-guided sampling logic |
LoggingMixin |
Training and validation metric logging |
ValidationMixin |
Validation step and evaluation metrics |
OptimizerMixin |
Optimizer and scheduler configuration |
Curriculum-driven behavior:
- Reads curriculum flags from
CurriculumScheduler. - Phase 2+ sampling:
- Embedding-based hard negative mining (Lorentzian distance).
- Router-guided negative mining (gate confusion).
- Norm-adaptive margins via
NormAdaptiveMargin(sech-based decay). - Phase 3 sampling:
- False-negative masking via clustering/pseudo-labels.
- False-negative strategy (
false_negatives.strategy): eliminatemasks pseudo-labeled false negatives (default).attractkeeps them and applies an auxiliary attraction loss scaled byattraction_weight.hybridcombines both behaviors for higher precision at the cost of extra compute.- Logging:
- Negative relationship distribution.
- Tree-distance bins.
- Router confusion and adaptive margins.
Interface Contract¶
- Inputs expected from data layer: negative embeddings and optional
explicit_exclusionflag; negatives per anchor already filtered/weighted for Phase 1. - Curriculum flags influence:
- Phase 1 flags (
use_tree_distance,mask_siblings) act in the data layer. - Phase 2/3 flags (
enable_hard_negative_mining,enable_router_guided_sampling,enable_clustering) act in the model layer. - Re-sampling: Phase 1 weighting occurs in streaming_dataset; later phases reuse provided negatives but reorder/mix based on mining strategies.
Migration for Legacy Chains¶
The legacy stage-by-stage curriculum files and chain configs are retired. To reproduce an old multi-stage job, acknowledge the deprecated workflow explicitly:
uv run naics-embedder train-seq --legacy --num-stages 3 --config conf/config.yaml
New work should rely on train plus overrides—the dynamic SADC scheduler replaces manual chains and
static curriculum files.
Performance Optimization¶
torch.compile Support¶
Core Lorentz operations are optimized using PyTorch 2.0+ torch.compile for improved throughput:
- Exponential/Logarithmic maps — Fused element-wise operations
- Distance computations — Compiled Lorentzian distance
- MoE gating — Compiled softmax operations
- Hard negative mining — Compiled norm and margin computations
Compilation is enabled by default when PyTorch 2.0+ is available. Configure via:
from naics_embedder.utils.compile import CompileConfig, set_compile_config
set_compile_config(CompileConfig(
enabled=True,
mode='reduce-overhead', # Best for small tensors / repeated calls
dynamic=True, # Support varying batch sizes
))
Disable compilation via environment variable:
NAICS_DISABLE_COMPILE=1 uv run naics-embedder train
Benchmark compiled vs eager operations:
from naics_embedder.utils.compile import benchmark_compile_speedup
results = benchmark_compile_speedup(batch_size=256, embedding_dim=768)
print(f"exp_map speedup: {results['exp_map']['speedup']:.2f}x")
print(f"distance speedup: {results['lorentz_distance']['speedup']:.2f}x")
Troubleshooting¶
- Dataset checks — Use
uv run naics-embedder tools configto confirm paths before training. - Flag visibility — Curriculum phase transitions and flag values are emitted in training logs.
- Memory pressure — Lower
data_loader.batch_sizeor increaseaccumulate_grad_batchesvia overrides. - Compile issues — If torch.compile causes problems, disable with
NAICS_DISABLE_COMPILE=1.