CLI Usage Guide

This guide covers all available CLI commands for the NAICS Embedder system.

Overview

The NAICS Embedder CLI is organized into three main command groups:

  • data - Data generation and preprocessing commands
  • tools - Utility tools for configuration, GPU optimization, and metrics
  • train - Model training with the dynamic SADC curriculum

Installation

The CLI is available as the naics-embedder command after installation:

uv run naics-embedder --help

Data Commands

data preprocess

Download and preprocess all raw NAICS data files.

Generates: data/naics_descriptions.parquet

uv run naics-embedder data preprocess

data relations

Compute pairwise graph relationships between all NAICS codes.

Requires: data/naics_descriptions.parquet
Generates: data/naics_relations.parquet

uv run naics-embedder data relations

data distances

Compute pairwise graph distances between all NAICS codes.

Requires: data/naics_descriptions.parquet
Generates: data/naics_distances.parquet

uv run naics-embedder data distances

data triplets

Generate (anchor, positive, negative) training triplets.

Requires: - data/naics_descriptions.parquet - data/naics_distances.parquet

Generates: data/naics_training_pairs.parquet

uv run naics-embedder data triplets

data all

Run the full data generation pipeline: preprocess, distances, and triplets.

uv run naics-embedder data all

Tools Commands

tools config

Display current training configuration, including the Structure-Aware Dynamic Curriculum (SADC) schedule.

uv run naics-embedder tools config

Options: - --config PATH - Path to base config YAML file (default: conf/config.yaml)

uv run naics-embedder tools config --config conf/config.yaml

tools gpu

Optimize training configuration for available GPU memory. Suggests optimal batch_size and accumulate_grad_batches based on your GPU.

# Auto-detect GPU memory
uv run naics-embedder tools gpu --auto

# Specify GPU memory manually
uv run naics-embedder tools gpu --gpu-memory 24

# Apply suggested configuration
uv run naics-embedder tools gpu --auto --apply

Options: - --gpu-memory FLOAT - GPU memory in GB (e.g., 24 for RTX 6000, 80 for A100) - --auto - Auto-detect GPU memory - --target-effective-batch INT - Target effective batch size (default: 256) - --apply - Apply suggested configuration to config files - --config PATH - Path to base config YAML file (default: conf/config.yaml)

tools visualize

Visualize training metrics from log files. Creates comprehensive visualizations and analysis of training metrics including: - Hyperbolic radius over time - Hierarchy preservation correlations - Embedding diversity metrics

uv run naics-embedder tools visualize --stage 02_text

Options: - --stage, -s STR - Stage name to filter (e.g., 02_text, default: 02_text) - --log-file PATH - Path to log file (default: logs/train_sequential.log) - --output-dir PATH - Output directory for plots (default: outputs/visualizations/)

tools investigate

Investigate why hierarchy preservation correlations might be low. Analyzes ground truth distances, evaluation configuration, and provides recommendations.

uv run naics-embedder tools investigate

Options: - --distance-matrix PATH - Path to ground truth distance matrix (default: data/naics_distance_matrix.parquet) - --config PATH - Path to config file (default: conf/config.yaml)


Training Commands

train

Train the contrastive encoder with the Structure-Aware Dynamic Curriculum (SADC). The scheduler drives phase transitions automatically—no curriculum files or chain configs are needed.

uv run naics-embedder train --config conf/config.yaml

Options: - --config PATH - Path to base config YAML file (default: conf/config.yaml) - --ckpt-path PATH - Path to checkpoint file to resume from, or "last" to auto-detect the latest checkpoint in the experiment directory - --skip-validation - Skip pre-flight validation of data files and caches - OVERRIDES... - Config overrides (e.g., training.learning_rate=1e-4 data.batch_size=64)

Examples:

# Standard run with SADC
uv run naics-embedder train

# Resume from last checkpoint in the experiment
uv run naics-embedder train --ckpt-path last

# Apply overrides for learning rate and epochs
uv run naics-embedder train --config conf/config.yaml \
  training.learning_rate=1e-4 training.trainer.max_epochs=20

train-seq

Deprecated sequential training workflow retained for legacy stage-chain jobs. Use SADC via train for new runs; train-seq now requires --legacy to acknowledge deprecation.

uv run naics-embedder train-seq --legacy --num-stages 3

Options: - --num-stages, -n INT - Number of sequential stages to run (default: 3) - --config PATH - Path to base config YAML file (default: conf/config.yaml) - --resume - Resume from last checkpoint if available - --legacy - Required to continue using the deprecated workflow - OVERRIDES... - Config overrides applied to every stage

Examples:

# Reproduce a historical 3-stage run
uv run naics-embedder train-seq --legacy --num-stages 3

Common Workflows

Complete Data Pipeline

# Generate all required data files
uv run naics-embedder data all

Standard Training

# Train with the dynamic SADC scheduler
uv run naics-embedder train

Dynamic SADC Training

# Legacy sequential flow (deprecated)
uv run naics-embedder train-seq --legacy --num-stages 3

View Configuration

# Display current configuration
uv run naics-embedder tools config

Analyze Training Metrics

# Visualize training metrics
uv run naics-embedder tools visualize --stage 02_text

# Investigate hierarchy preservation issues
uv run naics-embedder tools investigate

Getting Help

For help on any command, use the --help flag:

uv run naics-embedder --help
uv run naics-embedder data --help
uv run naics-embedder tools --help
uv run naics-embedder train --help

Configuration Files

The CLI reads a single configuration in conf/config.yaml:

  • Base Config: Paths, model hyperparameters, and trainer settings
  • Curriculum: curriculum.* fields configure SADC phase boundaries and false-negative elimination cadence

See the Configuration Documentation for details on configuration structure.