CLI API¶
CLI command modules for NAICS Embedder.
This package organizes CLI commands into logical groups: - data: Data generation and preprocessing commands - tools: Utility tools for configuration, GPU optimization, and metrics - training: Model training commands
Data Commands¶
CLI commands for NAICS data generation and preprocessing.
This module provides the data command group that orchestrates the data
preparation pipeline. Commands should be run in order or via data all.
Pipeline Stages
- preprocess: Download and clean raw NAICS data files
- relations: Compute pairwise graph relationships
- distances: Compute pairwise graph distances
- triplets: Generate training triplets for contrastive learning
Commands
preprocess: Download raw NAICS files and produce descriptions parquet. relations: Build relationship annotations between all NAICS codes. distances: Compute tree distances between all NAICS codes. triplets: Generate (anchor, positive, negative) training triplets. all: Run the complete data generation pipeline.
all_data()
¶
Run the complete data generation pipeline.
Executes all data preparation stages in order: preprocess, relations, distances, and triplets. This is the recommended way to prepare data for training from scratch.
Output
All data files required for training will be created in data/.
Example
Run the full pipeline::
$ uv run naics-embedder data all
Note
This command may take 10-30 minutes depending on your system.
Progress is logged to logs/data_all.log.
distances()
¶
Compute pairwise graph distances between all NAICS codes.
Calculates tree distances in the NAICS hierarchy for every pair of codes. Distance is computed as the sum of edges traversed to reach the lowest common ancestor. Used for hierarchy preservation loss and evaluation.
Requires
data/naics_descriptions.parquet - From the preprocess stage.
Output
data/naics_distances.parquet - Pairwise distance annotations.
data/naics_distance_matrix.parquet - Sparse matrix representation.
Example
Compute all pairwise distances::
$ uv run naics-embedder data distances
preprocess()
¶
Download and preprocess all raw NAICS data files.
Downloads the official 2022 NAICS taxonomy files from the U.S. Census Bureau and processes them into a unified descriptions parquet file.
The output file contains columns for code, title, description, examples, and exclusions for each NAICS code at all hierarchy levels (2-6 digit).
Output
data/naics_descriptions.parquet - Unified NAICS taxonomy data.
Example
Download and preprocess NAICS data::
$ uv run naics-embedder data preprocess
relations()
¶
Compute pairwise graph relationships between all NAICS codes.
Analyzes the NAICS hierarchy to determine relationship types between every pair of codes (child, sibling, cousin, etc.). These relationships are used for curriculum-based sampling during training.
Requires
data/naics_descriptions.parquet - From the preprocess stage.
Output
data/naics_relations.parquet - Pairwise relationship annotations.
data/naics_relation_matrix.parquet - Sparse matrix representation.
Example
Compute all pairwise relationships::
$ uv run naics-embedder data relations
triplets()
¶
Generate (anchor, positive, negative) training triplets.
Creates training triplets for contrastive learning by sampling anchors from the NAICS taxonomy and pairing them with positive samples (related codes) and negative samples (distant codes).
Triplet generation uses the distance and relation annotations to ensure meaningful contrastive pairs that respect the hierarchical structure.
Requires
data/naics_descriptions.parquet - From the preprocess stage.
data/naics_distances.parquet - From the distances stage.
data/naics_relations.parquet - From the relations stage.
Output
data/naics_training_pairs/ - Directory of parquet files with triplets.
Example
Generate training triplets::
$ uv run naics-embedder data triplets
Training Commands¶
CLI commands for training NAICS embedding models.
The train command is the supported entry point and runs the dynamic
Structure-Aware Dynamic Curriculum (SADC) workflow. The legacy sequential
command is retained only for backwards compatibility and is hidden from the
public help output.
generate_embeddings_from_checkpoint(checkpoint_path, config, output_path=None, batch_size=32)
¶
Generate hyperbolic embeddings parquet file from a trained checkpoint.
Loads a trained model checkpoint, runs inference on all NAICS codes, and writes the resulting embeddings to a parquet file compatible with HGCN training.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint_path
|
str
|
Filesystem path to the PyTorch Lightning checkpoint that contains the trained contrastive model weights. |
required |
config
|
Config
|
Project configuration containing data paths used for token caching and parquet loading. |
required |
output_path
|
Optional[str]
|
Optional path for the embeddings parquet. When omitted,
|
None
|
batch_size
|
int
|
Batch size to use during inference to balance throughput and memory usage. |
32
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Filesystem path to the generated embeddings parquet file. |
train(config_file='conf/config.yaml', ckpt_path=None, skip_validation=False, overrides=None)
¶
Train the NAICS text encoder with contrastive learning.
Orchestrates the complete training workflow including configuration loading, hardware detection, checkpoint management, and training execution with PyTorch Lightning. Supports resumption from checkpoints and runtime configuration overrides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_file
|
Annotated[str, Option(--config, help='Path to base config YAML file')]
|
Path to the base YAML configuration file that describes
data, model, and training settings. Defaults to |
'conf/config.yaml'
|
ckpt_path
|
Annotated[Optional[str], Option(--ckpt - path, help='Path to checkpoint file to resume from, or "last" to auto-detect last checkpoint')]
|
Optional checkpoint path to resume training. Use |
None
|
skip_validation
|
Annotated[bool, Option(--skip - validation, help='Skip pre-flight validation of data files and cache')]
|
Skip pre-flight validation checks for data files and tokenization cache. Useful when you know files are valid. |
False
|
overrides
|
Annotated[Optional[List[str]], Argument(help="Config overrides (e.g., 'training.learning_rate=1e-4 data.batch_size=64')")]
|
Optional list of key-value override strings. Use dot notation
to specify nested config values like |
None
|
Example
Train with default configuration::
$ uv run naics-embedder train
Resume from last checkpoint with custom learning rate::
$ uv run naics-embedder train --ckpt-path last training.learning_rate=1e-5
Tools Commands¶
CLI utility commands for configuration, GPU optimization, and metrics analysis.
This module provides the tools command group with utilities for inspecting
configuration, visualizing training metrics, and investigating model behavior.
Commands
config: Display current training configuration. visualize: Generate visualizations from training log files. investigate: Analyze hierarchy preservation metrics.
config(config_file='conf/config.yaml')
¶
Display the current training and curriculum configuration.
Loads the specified configuration file and displays a formatted summary of all settings including data paths, model architecture, training hyperparameters, and loss function weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_file
|
Annotated[str, Option(--config, help='Path to base config YAML file')]
|
Path to the YAML configuration file to display.
Defaults to |
'conf/config.yaml'
|
Example
Display default configuration::
$ uv run naics-embedder tools config
Display custom configuration::
$ uv run naics-embedder tools config --config conf/custom.yaml
investigate(distance_matrix=None, config_file=None)
¶
Analyze why hierarchy preservation correlations might be low.
Investigates potential causes for poor hierarchy preservation metrics by analyzing the ground truth distance matrix, evaluation configuration, and providing diagnostic recommendations.
Use this command when training produces unexpectedly low hierarchy correlation metrics to identify configuration or data issues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distance_matrix
|
Annotated[Optional[str], Option(--distance - matrix, help='Path to ground truth distance matrix')]
|
Path to the ground truth distance matrix parquet. When omitted, uses the path from the configuration file. |
None
|
config_file
|
Annotated[Optional[str], Option(--config, help='Path to config file (default: conf/config.yaml)')]
|
Path to the configuration file. When omitted, uses
the default |
None
|
Example
Investigate hierarchy metrics::
$ uv run naics-embedder tools investigate
Use custom distance matrix::
$ uv run naics-embedder tools investigate \
--distance-matrix data/custom_distances.parquet
verify_stage4_command(stage3_parquet='./output/hyperbolic_projection/encodings.parquet', stage4_parquet='./output/hgcn/encodings.parquet', distance_matrix='./data/naics_distance_matrix.parquet', relations_parquet='./data/naics_relations.parquet', max_cophenetic_drop=0.02, max_ndcg_drop=0.01, min_local_improvement=0.05, ndcg_k=10, parent_top_k=1)
¶
Compare Stage 3 and Stage 4 embeddings to ensure HGCN preserves global structure.
Computes cophenetic correlation, NDCG@K, and parent retrieval accuracy before/after HGCN refinement and enforces configurable degradation thresholds.
visualize(stage='02_text', log_file=None, output_dir=None)
¶
Visualize training metrics from log files.
Parses training log files and generates visualizations showing the progression of key metrics including contrastive loss, hierarchy correlation, embedding statistics, and learning rate schedules.
Output visualizations are saved as PNG files in the specified output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stage
|
Annotated[str, Option(--stage, -s, help="Stage name to filter (e.g., '02_text')")]
|
Stage identifier used to filter metrics. Use this to focus
on a specific training stage like |
'02_text'
|
log_file
|
Annotated[Optional[str], Option(--log - file, help='Path to log file (default: logs/train_sequential.log)')]
|
Path to the training log file to parse. When omitted,
defaults to |
None
|
output_dir
|
Annotated[Optional[str], Option(--output - dir, help='Output directory for plots (default: outputs/visualizations/)')]
|
Directory for saving visualization files. When omitted,
defaults to |
None
|
Example
Visualize metrics from default log::
$ uv run naics-embedder tools visualize --stage 02_text
Visualize custom log file::
$ uv run naics-embedder tools visualize --log-file logs/train.log