CLI API

CLI command modules for NAICS Embedder.

This package organizes CLI commands into logical groups: - data: Data generation and preprocessing commands - tools: Utility tools for configuration, GPU optimization, and metrics - training: Model training commands

Data Commands

CLI commands for NAICS data generation and preprocessing.

This module provides the data command group that orchestrates the data preparation pipeline. Commands should be run in order or via data all.

Pipeline Stages
  1. preprocess: Download and clean raw NAICS data files
  2. relations: Compute pairwise graph relationships
  3. distances: Compute pairwise graph distances
  4. triplets: Generate training triplets for contrastive learning
Commands

preprocess: Download raw NAICS files and produce descriptions parquet. relations: Build relationship annotations between all NAICS codes. distances: Compute tree distances between all NAICS codes. triplets: Generate (anchor, positive, negative) training triplets. all: Run the complete data generation pipeline.

all_data()

Run the complete data generation pipeline.

Executes all data preparation stages in order: preprocess, relations, distances, and triplets. This is the recommended way to prepare data for training from scratch.

Output

All data files required for training will be created in data/.

Example

Run the full pipeline::

$ uv run naics-embedder data all
Note

This command may take 10-30 minutes depending on your system. Progress is logged to logs/data_all.log.

distances()

Compute pairwise graph distances between all NAICS codes.

Calculates tree distances in the NAICS hierarchy for every pair of codes. Distance is computed as the sum of edges traversed to reach the lowest common ancestor. Used for hierarchy preservation loss and evaluation.

Requires

data/naics_descriptions.parquet - From the preprocess stage.

Output

data/naics_distances.parquet - Pairwise distance annotations. data/naics_distance_matrix.parquet - Sparse matrix representation.

Example

Compute all pairwise distances::

$ uv run naics-embedder data distances

preprocess()

Download and preprocess all raw NAICS data files.

Downloads the official 2022 NAICS taxonomy files from the U.S. Census Bureau and processes them into a unified descriptions parquet file.

The output file contains columns for code, title, description, examples, and exclusions for each NAICS code at all hierarchy levels (2-6 digit).

Output

data/naics_descriptions.parquet - Unified NAICS taxonomy data.

Example

Download and preprocess NAICS data::

$ uv run naics-embedder data preprocess

relations()

Compute pairwise graph relationships between all NAICS codes.

Analyzes the NAICS hierarchy to determine relationship types between every pair of codes (child, sibling, cousin, etc.). These relationships are used for curriculum-based sampling during training.

Requires

data/naics_descriptions.parquet - From the preprocess stage.

Output

data/naics_relations.parquet - Pairwise relationship annotations. data/naics_relation_matrix.parquet - Sparse matrix representation.

Example

Compute all pairwise relationships::

$ uv run naics-embedder data relations

triplets()

Generate (anchor, positive, negative) training triplets.

Creates training triplets for contrastive learning by sampling anchors from the NAICS taxonomy and pairing them with positive samples (related codes) and negative samples (distant codes).

Triplet generation uses the distance and relation annotations to ensure meaningful contrastive pairs that respect the hierarchical structure.

Requires

data/naics_descriptions.parquet - From the preprocess stage. data/naics_distances.parquet - From the distances stage. data/naics_relations.parquet - From the relations stage.

Output

data/naics_training_pairs/ - Directory of parquet files with triplets.

Example

Generate training triplets::

$ uv run naics-embedder data triplets

Training Commands

CLI commands for training NAICS embedding models.

The train command is the supported entry point and runs the dynamic Structure-Aware Dynamic Curriculum (SADC) workflow. The legacy sequential command is retained only for backwards compatibility and is hidden from the public help output.

generate_embeddings_from_checkpoint(checkpoint_path, config, output_path=None, batch_size=32)

Generate hyperbolic embeddings parquet file from a trained checkpoint.

Loads a trained model checkpoint, runs inference on all NAICS codes, and writes the resulting embeddings to a parquet file compatible with HGCN training.

Parameters:

Name Type Description Default
checkpoint_path str

Filesystem path to the PyTorch Lightning checkpoint that contains the trained contrastive model weights.

required
config Config

Project configuration containing data paths used for token caching and parquet loading.

required
output_path Optional[str]

Optional path for the embeddings parquet. When omitted, output/hyperbolic_projection/encodings.parquet is used.

None
batch_size int

Batch size to use during inference to balance throughput and memory usage.

32

Returns:

Name Type Description
str str

Filesystem path to the generated embeddings parquet file.

train(config_file='conf/config.yaml', ckpt_path=None, skip_validation=False, overrides=None)

Train the NAICS text encoder with contrastive learning.

Orchestrates the complete training workflow including configuration loading, hardware detection, checkpoint management, and training execution with PyTorch Lightning. Supports resumption from checkpoints and runtime configuration overrides.

Parameters:

Name Type Description Default
config_file Annotated[str, Option(--config, help='Path to base config YAML file')]

Path to the base YAML configuration file that describes data, model, and training settings. Defaults to conf/config.yaml.

'conf/config.yaml'
ckpt_path Annotated[Optional[str], Option(--ckpt - path, help='Path to checkpoint file to resume from, or "last" to auto-detect last checkpoint')]

Optional checkpoint path to resume training. Use last to automatically pick up the latest checkpoint for the configured experiment. Specify a full path for cross-experiment resumption.

None
skip_validation Annotated[bool, Option(--skip - validation, help='Skip pre-flight validation of data files and cache')]

Skip pre-flight validation checks for data files and tokenization cache. Useful when you know files are valid.

False
overrides Annotated[Optional[List[str]], Argument(help="Config overrides (e.g., 'training.learning_rate=1e-4 data.batch_size=64')")]

Optional list of key-value override strings. Use dot notation to specify nested config values like training.learning_rate=1e-4.

None
Example

Train with default configuration::

$ uv run naics-embedder train

Resume from last checkpoint with custom learning rate::

$ uv run naics-embedder train --ckpt-path last training.learning_rate=1e-5

Tools Commands

CLI utility commands for configuration, GPU optimization, and metrics analysis.

This module provides the tools command group with utilities for inspecting configuration, visualizing training metrics, and investigating model behavior.

Commands

config: Display current training configuration. visualize: Generate visualizations from training log files. investigate: Analyze hierarchy preservation metrics.

config(config_file='conf/config.yaml')

Display the current training and curriculum configuration.

Loads the specified configuration file and displays a formatted summary of all settings including data paths, model architecture, training hyperparameters, and loss function weights.

Parameters:

Name Type Description Default
config_file Annotated[str, Option(--config, help='Path to base config YAML file')]

Path to the YAML configuration file to display. Defaults to conf/config.yaml.

'conf/config.yaml'
Example

Display default configuration::

$ uv run naics-embedder tools config

Display custom configuration::

$ uv run naics-embedder tools config --config conf/custom.yaml

investigate(distance_matrix=None, config_file=None)

Analyze why hierarchy preservation correlations might be low.

Investigates potential causes for poor hierarchy preservation metrics by analyzing the ground truth distance matrix, evaluation configuration, and providing diagnostic recommendations.

Use this command when training produces unexpectedly low hierarchy correlation metrics to identify configuration or data issues.

Parameters:

Name Type Description Default
distance_matrix Annotated[Optional[str], Option(--distance - matrix, help='Path to ground truth distance matrix')]

Path to the ground truth distance matrix parquet. When omitted, uses the path from the configuration file.

None
config_file Annotated[Optional[str], Option(--config, help='Path to config file (default: conf/config.yaml)')]

Path to the configuration file. When omitted, uses the default conf/config.yaml.

None
Example

Investigate hierarchy metrics::

$ uv run naics-embedder tools investigate

Use custom distance matrix::

$ uv run naics-embedder tools investigate \
    --distance-matrix data/custom_distances.parquet

verify_stage4_command(stage3_parquet='./output/hyperbolic_projection/encodings.parquet', stage4_parquet='./output/hgcn/encodings.parquet', distance_matrix='./data/naics_distance_matrix.parquet', relations_parquet='./data/naics_relations.parquet', max_cophenetic_drop=0.02, max_ndcg_drop=0.01, min_local_improvement=0.05, ndcg_k=10, parent_top_k=1)

Compare Stage 3 and Stage 4 embeddings to ensure HGCN preserves global structure.

Computes cophenetic correlation, NDCG@K, and parent retrieval accuracy before/after HGCN refinement and enforces configurable degradation thresholds.

visualize(stage='02_text', log_file=None, output_dir=None)

Visualize training metrics from log files.

Parses training log files and generates visualizations showing the progression of key metrics including contrastive loss, hierarchy correlation, embedding statistics, and learning rate schedules.

Output visualizations are saved as PNG files in the specified output directory.

Parameters:

Name Type Description Default
stage Annotated[str, Option(--stage, -s, help="Stage name to filter (e.g., '02_text')")]

Stage identifier used to filter metrics. Use this to focus on a specific training stage like 02_text.

'02_text'
log_file Annotated[Optional[str], Option(--log - file, help='Path to log file (default: logs/train_sequential.log)')]

Path to the training log file to parse. When omitted, defaults to logs/train_sequential.log.

None
output_dir Annotated[Optional[str], Option(--output - dir, help='Output directory for plots (default: outputs/visualizations/)')]

Directory for saving visualization files. When omitted, defaults to outputs/visualizations/.

None
Example

Visualize metrics from default log::

$ uv run naics-embedder tools visualize --stage 02_text

Visualize custom log file::

$ uv run naics-embedder tools visualize --log-file logs/train.log