CLI API¶

CLI command modules for NAICS Embedder.

This package organizes CLI commands into logical groups: - data: Data generation and preprocessing commands - tools: Utility tools for configuration, GPU optimization, and metrics - training: Model training commands

Data Commands¶

CLI commands for NAICS data generation and preprocessing.

This module provides the data command group that orchestrates the data preparation pipeline. Commands should be run in order or via data all.

Pipeline Stages

preprocess: Download and clean raw NAICS data files
relations: Compute pairwise graph relationships
distances: Compute pairwise graph distances
triplets: Generate training triplets for contrastive learning

Commands

preprocess: Download raw NAICS files and produce descriptions parquet. relations: Build relationship annotations between all NAICS codes. distances: Compute tree distances between all NAICS codes. triplets: Generate (anchor, positive, negative) training triplets. all: Run the complete data generation pipeline.

`all_data()` ¶

Run the complete data generation pipeline.

Executes all data preparation stages in order: preprocess, relations, distances, and triplets. This is the recommended way to prepare data for training from scratch.

Output

All data files required for training will be created in data/.

Example

Run the full pipeline::

$ uv run naics-embedder data all

Note

This command may take 10-30 minutes depending on your system. Progress is logged to logs/data_all.log.

`distances()` ¶

Compute pairwise graph distances between all NAICS codes.

Calculates tree distances in the NAICS hierarchy for every pair of codes. Distance is computed as the sum of edges traversed to reach the lowest common ancestor. Used for hierarchy preservation loss and evaluation.

Requires

data/naics_descriptions.parquet - From the preprocess stage.

Output

data/naics_distances.parquet - Pairwise distance annotations. data/naics_distance_matrix.parquet - Sparse matrix representation.

Example

Compute all pairwise distances::

$ uv run naics-embedder data distances

`preprocess()` ¶

Download and preprocess all raw NAICS data files.

Downloads the official 2022 NAICS taxonomy files from the U.S. Census Bureau and processes them into a unified descriptions parquet file.

The output file contains columns for code, title, description, examples, and exclusions for each NAICS code at all hierarchy levels (2-6 digit).

Output

data/naics_descriptions.parquet - Unified NAICS taxonomy data.

Example

Download and preprocess NAICS data::

$ uv run naics-embedder data preprocess

`relations()` ¶

Compute pairwise graph relationships between all NAICS codes.

Analyzes the NAICS hierarchy to determine relationship types between every pair of codes (child, sibling, cousin, etc.). These relationships are used for curriculum-based sampling during training.

Requires

data/naics_descriptions.parquet - From the preprocess stage.

Output

data/naics_relations.parquet - Pairwise relationship annotations. data/naics_relation_matrix.parquet - Sparse matrix representation.

Example

Compute all pairwise relationships::

$ uv run naics-embedder data relations

`triplets()` ¶

Generate (anchor, positive, negative) training triplets.

Creates training triplets for contrastive learning by sampling anchors from the NAICS taxonomy and pairing them with positive samples (related codes) and negative samples (distant codes).

Triplet generation uses the distance and relation annotations to ensure meaningful contrastive pairs that respect the hierarchical structure.

Requires

data/naics_descriptions.parquet - From the preprocess stage. data/naics_distances.parquet - From the distances stage. data/naics_relations.parquet - From the relations stage.

Output

data/naics_training_pairs/ - Directory of parquet files with triplets.

Example

Generate training triplets::

$ uv run naics-embedder data triplets

Training Commands¶

CLI commands for training NAICS embedding models.

The train command is the supported entry point and runs the dynamic Structure-Aware Dynamic Curriculum (SADC) workflow. The legacy sequential command is retained only for backwards compatibility and is hidden from the public help output.

`generate_embeddings_from_checkpoint(checkpoint_path, config, output_path=None, batch_size=32)` ¶

Generate hyperbolic embeddings parquet file from a trained checkpoint.

Loads a trained model checkpoint, runs inference on all NAICS codes, and writes the resulting embeddings to a parquet file compatible with HGCN training.

Parameters:

Name	Type	Description	Default
`checkpoint_path`	`str`	Filesystem path to the PyTorch Lightning checkpoint that contains the trained contrastive model weights.	required
`config`	`Config`	Project configuration containing data paths used for token caching and parquet loading.	required
`output_path`	`Optional[str]`	Optional path for the embeddings parquet. When omitted, `output/hyperbolic_projection/encodings.parquet` is used.	`None`
`batch_size`	`int`	Batch size to use during inference to balance throughput and memory usage.	`32`

Returns:

Name	Type	Description
`str`	`str`	Filesystem path to the generated embeddings parquet file.

`train(config_file='conf/config.yaml', ckpt_path=None, skip_validation=False, overrides=None)` ¶

Train the NAICS text encoder with contrastive learning.

Orchestrates the complete training workflow including configuration loading, hardware detection, checkpoint management, and training execution with PyTorch Lightning. Supports resumption from checkpoints and runtime configuration overrides.

Parameters:

Name	Type	Description	Default
`config_file`	`Annotated[str, Option(--config, help='Path to base config YAML file')]`	Path to the base YAML configuration file that describes data, model, and training settings. Defaults to `conf/config.yaml`.	`'conf/config.yaml'`
`ckpt_path`	`Annotated[Optional[str], Option(--ckpt - path, help='Path to checkpoint file to resume from, or "last" to auto-detect last checkpoint')]`	Optional checkpoint path to resume training. Use `last` to automatically pick up the latest checkpoint for the configured experiment. Specify a full path for cross-experiment resumption.	`None`
`skip_validation`	`Annotated[bool, Option(--skip - validation, help='Skip pre-flight validation of data files and cache')]`	Skip pre-flight validation checks for data files and tokenization cache. Useful when you know files are valid.	`False`
`overrides`	`Annotated[Optional[List[str]], Argument(help="Config overrides (e.g., 'training.learning_rate=1e-4 data.batch_size=64')")]`	Optional list of key-value override strings. Use dot notation to specify nested config values like `training.learning_rate=1e-4`.	`None`

Example

Train with default configuration::

$ uv run naics-embedder train

Resume from last checkpoint with custom learning rate::

$ uv run naics-embedder train --ckpt-path last training.learning_rate=1e-5

Tools Commands¶

CLI utility commands for configuration, GPU optimization, and metrics analysis.

This module provides the tools command group with utilities for inspecting configuration, visualizing training metrics, and investigating model behavior.

Commands

config: Display current training configuration. visualize: Generate visualizations from training log files. investigate: Analyze hierarchy preservation metrics.

`config(config_file='conf/config.yaml')` ¶

Display the current training and curriculum configuration.

Loads the specified configuration file and displays a formatted summary of all settings including data paths, model architecture, training hyperparameters, and loss function weights.

Parameters:

Name	Type	Description	Default
`config_file`	`Annotated[str, Option(--config, help='Path to base config YAML file')]`	Path to the YAML configuration file to display. Defaults to `conf/config.yaml`.	`'conf/config.yaml'`

Example

Display default configuration::

$ uv run naics-embedder tools config

Display custom configuration::

$ uv run naics-embedder tools config --config conf/custom.yaml

`investigate(distance_matrix=None, config_file=None)` ¶

Analyze why hierarchy preservation correlations might be low.

Investigates potential causes for poor hierarchy preservation metrics by analyzing the ground truth distance matrix, evaluation configuration, and providing diagnostic recommendations.

Use this command when training produces unexpectedly low hierarchy correlation metrics to identify configuration or data issues.

Parameters:

Name	Type	Description	Default
`distance_matrix`	`Annotated[Optional[str], Option(--distance - matrix, help='Path to ground truth distance matrix')]`	Path to the ground truth distance matrix parquet. When omitted, uses the path from the configuration file.	`None`
`config_file`	`Annotated[Optional[str], Option(--config, help='Path to config file (default: conf/config.yaml)')]`	Path to the configuration file. When omitted, uses the default `conf/config.yaml`.	`None`

Example

Investigate hierarchy metrics::

$ uv run naics-embedder tools investigate

Use custom distance matrix::

$ uv run naics-embedder tools investigate \
    --distance-matrix data/custom_distances.parquet

`verify_stage4_command(stage3_parquet='./output/hyperbolic_projection/encodings.parquet', stage4_parquet='./output/hgcn/encodings.parquet', distance_matrix='./data/naics_distance_matrix.parquet', relations_parquet='./data/naics_relations.parquet', max_cophenetic_drop=0.02, max_ndcg_drop=0.01, min_local_improvement=0.05, ndcg_k=10, parent_top_k=1)` ¶

Compare Stage 3 and Stage 4 embeddings to ensure HGCN preserves global structure.

Computes cophenetic correlation, NDCG@K, and parent retrieval accuracy before/after HGCN refinement and enforces configurable degradation thresholds.

`visualize(stage='02_text', log_file=None, output_dir=None)` ¶

Visualize training metrics from log files.

Parses training log files and generates visualizations showing the progression of key metrics including contrastive loss, hierarchy correlation, embedding statistics, and learning rate schedules.

Output visualizations are saved as PNG files in the specified output directory.

Parameters:

Name	Type	Description	Default
`stage`	`Annotated[str, Option(--stage, -s, help="Stage name to filter (e.g., '02_text')")]`	Stage identifier used to filter metrics. Use this to focus on a specific training stage like `02_text`.	`'02_text'`
`log_file`	`Annotated[Optional[str], Option(--log - file, help='Path to log file (default: logs/train_sequential.log)')]`	Path to the training log file to parse. When omitted, defaults to `logs/train_sequential.log`.	`None`
`output_dir`	`Annotated[Optional[str], Option(--output - dir, help='Output directory for plots (default: outputs/visualizations/)')]`	Directory for saving visualization files. When omitted, defaults to `outputs/visualizations/`.	`None`

Example

Visualize metrics from default log::

$ uv run naics-embedder tools visualize --stage 02_text

Visualize custom log file::

$ uv run naics-embedder tools visualize --log-file logs/train.log

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

CLI API¶

Data Commands¶

all_data() ¶

distances() ¶

preprocess() ¶

relations() ¶

triplets() ¶

Training Commands¶

generate_embeddings_from_checkpoint(checkpoint_path, config, output_path=None, batch_size=32) ¶

train(config_file='conf/config.yaml', ckpt_path=None, skip_validation=False, overrides=None) ¶

Tools Commands¶

config(config_file='conf/config.yaml') ¶

investigate(distance_matrix=None, config_file=None) ¶

visualize(stage='02_text', log_file=None, output_dir=None) ¶

`all_data()` ¶

`distances()` ¶

`preprocess()` ¶

`relations()` ¶

`triplets()` ¶

`generate_embeddings_from_checkpoint(checkpoint_path, config, output_path=None, batch_size=32)` ¶

`train(config_file='conf/config.yaml', ckpt_path=None, skip_validation=False, overrides=None)` ¶

`config(config_file='conf/config.yaml')` ¶

`investigate(distance_matrix=None, config_file=None)` ¶

`visualize(stage='02_text', log_file=None, output_dir=None)` ¶