Training Guide

This guide explains how to train the NAICS Hyperbolic Embedding System with the dynamic Structure-Aware Dynamic Curriculum (SADC) scheduler. The current workflow uses a single configuration file (conf/config.yaml) to control model, data, trainer, and curriculum settings.

Table of Contents


Quick Start

Preprocess data first (see docs/usage.md), then launch training with the default config:

uv run naics-embedder train --config conf/config.yaml

Apply overrides inline—SADC will stay active and adapt phases automatically:

uv run naics-embedder train --config conf/config.yaml \
  training.learning_rate=1e-4 training.trainer.max_epochs=20

SADC Scheduler

The scheduler runs three phases within a single training invocation:

  1. Structural Initialization (0–30%)
  2. Flags: use_tree_distance, mask_siblings
  3. Effect: weights negatives by inverse tree distance and masks siblings.
  4. Geometric Refinement (30–70%)
  5. Flags: enable_hard_negative_mining, enable_router_guided_sampling
  6. Effect: activates Lorentzian hard-negative mining and router-guided MoE sampling.
  7. False Negative Mitigation (70–100%)
  8. Flags: enable_clustering
  9. Effect: enables clustering-driven false-negative elimination.

Phase boundaries are derived from the trainer's max_epochs. Flag transitions are logged to help verify when each mechanism is active.

Two additional knobs were added for the experimentation tracks in Issue #44:

  • curriculum.phase_mode=two_phase merges Phase 3 behaviors into Phase 2 for a simpler two-stage schedule.
  • curriculum.anneal.* enables continuous schedules (e.g., annealing the tree-distance exponent or router mix ratio over epochs or when a metric threshold is reached).

CLI Reference

Use the train command for all new runs:

uv run naics-embedder train --config conf/config.yaml

Key options:

  • --config PATH — Base config file (default: conf/config.yaml).
  • --ckpt-path PATH — Resume from a checkpoint or use last to pick the most recent run artifact.
  • --skip-validation — Bypass pre-flight validation when inputs are already verified.
  • OVERRIDES... — Space-separated config overrides (e.g., training.learning_rate=1e-4).

Resuming and Overrides

Resume the latest checkpoint produced under the current experiment name:

uv run naics-embedder train --ckpt-path last

Override trainer settings without editing YAML:

uv run naics-embedder train \
  training.trainer.max_epochs=15 training.trainer.accumulate_grad_batches=4

Sampling Architecture: Data Layer vs Model Layer

This page clarifies the split between the streaming data pipeline and the model during curriculum-driven training.

Data Layer (Streaming Dataset)

  • Build candidate pools from precomputed triplets and taxonomy indices.
  • Phase 1 sampling:
  • Inverse tree-distance weighting (P(n) ∝ 1 / d_tree(a, n)^α).
  • Sibling masking (d_tree <= 2 set to zero).
  • Explicit exclusion mining (excluded_codes map) with high-priority weights and an explicit_exclusion flag on sampled negatives.
  • Static baseline (SANS):
  • Set sampling.strategy=sans_static to replace the dynamic weighting with fixed near/far buckets.
  • Configure bucket ratios under sampling.sans_static (e.g., near_bucket_weight, near_distance_threshold).
  • The dataloader emits sampling_metadata so the model can log near/far percentages per batch, making it easier to benchmark against Issue #43.
  • Outputs:
  • Tokenized anchors/positives.
  • Negatives annotated with relation_margin, distance_margin, and explicit_exclusion.
  • Shared negatives reused for multi-level positives (ancestor supervision).

Model Layer (NAICSContrastiveModel)

The model is decomposed into functional mixins for maintainability:

Mixin Responsibility
DistributedMixin Global batch sampling utilities for multi-GPU training
LossMixin Hierarchy loss, LambdaRank loss, radius regularization
CurriculumMixin Hard negative mining, router-guided sampling logic
LoggingMixin Training and validation metric logging
ValidationMixin Validation step and evaluation metrics
OptimizerMixin Optimizer and scheduler configuration

Curriculum-driven behavior:

  • Reads curriculum flags from CurriculumScheduler.
  • Phase 2+ sampling:
  • Embedding-based hard negative mining (Lorentzian distance).
  • Router-guided negative mining (gate confusion).
  • Norm-adaptive margins via NormAdaptiveMargin (sech-based decay).
  • Phase 3 sampling:
  • False-negative masking via clustering/pseudo-labels.
  • False-negative strategy (false_negatives.strategy):
  • eliminate masks pseudo-labeled false negatives (default).
  • attract keeps them and applies an auxiliary attraction loss scaled by attraction_weight.
  • hybrid combines both behaviors for higher precision at the cost of extra compute.
  • Logging:
  • Negative relationship distribution.
  • Tree-distance bins.
  • Router confusion and adaptive margins.

Interface Contract

  • Inputs expected from data layer: negative embeddings and optional explicit_exclusion flag; negatives per anchor already filtered/weighted for Phase 1.
  • Curriculum flags influence:
  • Phase 1 flags (use_tree_distance, mask_siblings) act in the data layer.
  • Phase 2/3 flags (enable_hard_negative_mining, enable_router_guided_sampling, enable_clustering) act in the model layer.
  • Re-sampling: Phase 1 weighting occurs in streaming_dataset; later phases reuse provided negatives but reorder/mix based on mining strategies.

Migration for Legacy Chains

The legacy stage-by-stage curriculum files and chain configs are retired. To reproduce an old multi-stage job, acknowledge the deprecated workflow explicitly:

uv run naics-embedder train-seq --legacy --num-stages 3 --config conf/config.yaml

New work should rely on train plus overrides—the dynamic SADC scheduler replaces manual chains and static curriculum files.


Performance Optimization

torch.compile Support

Core Lorentz operations are optimized using PyTorch 2.0+ torch.compile for improved throughput:

  • Exponential/Logarithmic maps — Fused element-wise operations
  • Distance computations — Compiled Lorentzian distance
  • MoE gating — Compiled softmax operations
  • Hard negative mining — Compiled norm and margin computations

Compilation is enabled by default when PyTorch 2.0+ is available. Configure via:

from naics_embedder.utils.compile import CompileConfig, set_compile_config

set_compile_config(CompileConfig(
    enabled=True,
    mode='reduce-overhead',  # Best for small tensors / repeated calls
    dynamic=True,            # Support varying batch sizes
))

Disable compilation via environment variable:

NAICS_DISABLE_COMPILE=1 uv run naics-embedder train

Benchmark compiled vs eager operations:

from naics_embedder.utils.compile import benchmark_compile_speedup

results = benchmark_compile_speedup(batch_size=256, embedding_dim=768)
print(f"exp_map speedup: {results['exp_map']['speedup']:.2f}x")
print(f"distance speedup: {results['lorentz_distance']['speedup']:.2f}x")

Troubleshooting

  • Dataset checks — Use uv run naics-embedder tools config to confirm paths before training.
  • Flag visibility — Curriculum phase transitions and flag values are emitted in training logs.
  • Memory pressure — Lower data_loader.batch_size or increase accumulate_grad_batches via overrides.
  • Compile issues — If torch.compile causes problems, disable with NAICS_DISABLE_COMPILE=1.