CLI Usage Guide¶
This guide covers all available CLI commands for the NAICS Embedder system.
Overview¶
The NAICS Embedder CLI is organized into three main command groups:
data- Data generation and preprocessing commandstools- Utility tools for configuration, GPU optimization, and metricstrain- Model training with the dynamic SADC curriculum
Installation¶
The CLI is available as the naics-embedder command after installation:
uv run naics-embedder --help
Data Commands¶
data preprocess¶
Download and preprocess all raw NAICS data files.
Generates: data/naics_descriptions.parquet
uv run naics-embedder data preprocess
data relations¶
Compute pairwise graph relationships between all NAICS codes.
Requires: data/naics_descriptions.parquet
Generates: data/naics_relations.parquet
uv run naics-embedder data relations
data distances¶
Compute pairwise graph distances between all NAICS codes.
Requires: data/naics_descriptions.parquet
Generates: data/naics_distances.parquet
uv run naics-embedder data distances
data triplets¶
Generate (anchor, positive, negative) training triplets.
Requires:
- data/naics_descriptions.parquet
- data/naics_distances.parquet
Generates: data/naics_training_pairs.parquet
uv run naics-embedder data triplets
data all¶
Run the full data generation pipeline: preprocess, distances, and triplets.
uv run naics-embedder data all
Tools Commands¶
tools config¶
Display current training configuration, including the Structure-Aware Dynamic Curriculum (SADC) schedule.
uv run naics-embedder tools config
Options:
- --config PATH - Path to base config YAML file (default: conf/config.yaml)
uv run naics-embedder tools config --config conf/config.yaml
tools gpu¶
Optimize training configuration for available GPU memory. Suggests optimal batch_size and accumulate_grad_batches based on your GPU.
# Auto-detect GPU memory
uv run naics-embedder tools gpu --auto
# Specify GPU memory manually
uv run naics-embedder tools gpu --gpu-memory 24
# Apply suggested configuration
uv run naics-embedder tools gpu --auto --apply
Options:
- --gpu-memory FLOAT - GPU memory in GB (e.g., 24 for RTX 6000, 80 for A100)
- --auto - Auto-detect GPU memory
- --target-effective-batch INT - Target effective batch size (default: 256)
- --apply - Apply suggested configuration to config files
- --config PATH - Path to base config YAML file (default: conf/config.yaml)
tools visualize¶
Visualize training metrics from log files. Creates comprehensive visualizations and analysis of training metrics including: - Hyperbolic radius over time - Hierarchy preservation correlations - Embedding diversity metrics
uv run naics-embedder tools visualize --stage 02_text
Options:
- --stage, -s STR - Stage name to filter (e.g., 02_text, default: 02_text)
- --log-file PATH - Path to log file (default: logs/train_sequential.log)
- --output-dir PATH - Output directory for plots (default: outputs/visualizations/)
tools investigate¶
Investigate why hierarchy preservation correlations might be low. Analyzes ground truth distances, evaluation configuration, and provides recommendations.
uv run naics-embedder tools investigate
Options:
- --distance-matrix PATH - Path to ground truth distance matrix (default: data/naics_distance_matrix.parquet)
- --config PATH - Path to config file (default: conf/config.yaml)
Training Commands¶
train¶
Train the contrastive encoder with the Structure-Aware Dynamic Curriculum (SADC). The scheduler drives phase transitions automatically—no curriculum files or chain configs are needed.
uv run naics-embedder train --config conf/config.yaml
Options:
- --config PATH - Path to base config YAML file (default: conf/config.yaml)
- --ckpt-path PATH - Path to checkpoint file to resume from, or "last" to auto-detect the latest checkpoint in the experiment directory
- --skip-validation - Skip pre-flight validation of data files and caches
- OVERRIDES... - Config overrides (e.g., training.learning_rate=1e-4 data.batch_size=64)
Examples:
# Standard run with SADC
uv run naics-embedder train
# Resume from last checkpoint in the experiment
uv run naics-embedder train --ckpt-path last
# Apply overrides for learning rate and epochs
uv run naics-embedder train --config conf/config.yaml \
training.learning_rate=1e-4 training.trainer.max_epochs=20
train-seq¶
Deprecated sequential training workflow retained for legacy stage-chain jobs. Use SADC via
train for new runs; train-seq now requires --legacy to acknowledge deprecation.
uv run naics-embedder train-seq --legacy --num-stages 3
Options:
- --num-stages, -n INT - Number of sequential stages to run (default: 3)
- --config PATH - Path to base config YAML file (default: conf/config.yaml)
- --resume - Resume from last checkpoint if available
- --legacy - Required to continue using the deprecated workflow
- OVERRIDES... - Config overrides applied to every stage
Examples:
# Reproduce a historical 3-stage run
uv run naics-embedder train-seq --legacy --num-stages 3
Common Workflows¶
Complete Data Pipeline¶
# Generate all required data files
uv run naics-embedder data all
Standard Training¶
# Train with the dynamic SADC scheduler
uv run naics-embedder train
Dynamic SADC Training¶
# Legacy sequential flow (deprecated)
uv run naics-embedder train-seq --legacy --num-stages 3
View Configuration¶
# Display current configuration
uv run naics-embedder tools config
Analyze Training Metrics¶
# Visualize training metrics
uv run naics-embedder tools visualize --stage 02_text
# Investigate hierarchy preservation issues
uv run naics-embedder tools investigate
Getting Help¶
For help on any command, use the --help flag:
uv run naics-embedder --help
uv run naics-embedder data --help
uv run naics-embedder tools --help
uv run naics-embedder train --help
Configuration Files¶
The CLI reads a single configuration in conf/config.yaml:
- Base Config: Paths, model hyperparameters, and trainer settings
- Curriculum:
curriculum.*fields configure SADC phase boundaries and false-negative elimination cadence
See the Configuration Documentation for details on configuration structure.