NAICS Hyperbolic Embedding System¶

Unified Framework for Hierarchical Representation Learning

Table of Contents¶

Overview
System Architecture Overview
Multi-Channel Text Encoding
Mixture-of-Experts Fusion
Hyperbolic Geometry & Lorentz Model
Contrastive Learning Framework
Sampling Strategies
Curriculum Learning (SADC)
False Negative Mitigation
Additional Loss Components
Evaluation Metrics
Distributed Training
Implementation Reference

1. Overview¶

The NAICS Hyperbolic Embedding System is a unified framework for learning hierarchical representations of the North American Industry Classification System (NAICS) taxonomy. The system addresses a fundamental challenge in representation learning: embedding tree-structured categorical data into a continuous vector space while preserving hierarchical relationships.

Unlike standard classification approaches that treat categories as equidistant entities, this system recognizes that NAICS codes exist within a rich taxonomic structure spanning from broad Sectors (2-digit) to precise National Industries (6-digit). The semantic distance between sibling codes like 541511 (Custom Computer Programming) and 541512 (Computer Systems Design) is fundamentally different from the distance to 111110 (Soybean Farming).

Key Architectural Decisions¶

1. Hyperbolic Geometry (Lorentz Model): Euclidean space is geometrically incompatible with tree structures—tree nodes grow exponentially with depth while Euclidean volume grows only polynomially. Hyperbolic space, with its exponential volume growth, provides a natural, low-distortion embedding environment for hierarchies. The Lorentz model is chosen over the Poincaré ball for its superior numerical stability.

2. Mixture-of-Experts Fusion: Each NAICS code has four text channels (title, description, examples, excluded) with heterogeneous informativeness. MoE with Top-2 gating enables learning multiple specialized fusion strategies, allowing different experts to handle different types of codes.

3. Curriculum-Based Training: A three-phase Structure-Aware Dynamic Curriculum (SADC) progressively introduces complexity: structural initialization → geometric refinement → false negative mitigation.

4. Decoupled Contrastive Learning: DCL provides better gradient flow and numerical stability compared to standard InfoNCE, with the loss computed as:

L = (-pos_sim + logsumexp(neg_sims)).mean()

2. System Architecture Overview¶

The system consists of four sequential stages, each designed to preserve or enhance the hierarchical geometry of NAICS codes:

Stage	Component	Output
1	Multi-Channel Text Encoding (4 LoRA-adapted transformers)	E_title, E_desc, E_examples, E_excluded (4 × embedding_dim)
2	Mixture-of-Experts Fusion (Top-2 gating, 4 experts)	E_fused (embedding_dim)
3	Hyperbolic Projection (Lorentz exponential map)	E_hyp (embedding_dim + 1)
4	Contrastive Learning (DCL + auxiliary losses)	Trained embeddings on Lorentz hyperboloid

Data Flow Diagram¶

NAICS Code (4 text channels)
        ↓
[Multi-Channel Encoder]
    ├─→ Title Encoder (LoRA) → E_title
    ├─→ Description Encoder (LoRA) → E_desc
    ├─→ Examples Encoder (LoRA) → E_examples
    └─→ Excluded Encoder (LoRA) → E_excluded
        ↓
[Concatenate] → (embedding_dim × 4)
        ↓
[MoE Fusion] → E_fused (embedding_dim)
        ↓
[Hyperbolic Projection] → E_hyp (embedding_dim + 1)
        ↓
[Lorentz Hyperboloid] → Final Embedding

3. Multi-Channel Text Encoding¶

Each NAICS code is characterized by four distinct text fields, each providing complementary information about the industry classification:

Channel	Content	Purpose
Title	Short code name (e.g., "Software Publishers")	Concise category identification
Description	Detailed explanation of what the code encompasses	Rich semantic content
Examples	Representative businesses in this category	Concrete instantiations
Excluded	Codes explicitly NOT in this category	Disambiguation and boundaries

LoRA Adaptation¶

Each channel uses a separate LoRA-adapted transformer encoder based on sentence-transformers. LoRA (Low-Rank Adaptation) reduces trainable parameters while maintaining expressiveness:

Parameter	Default Value	Description
base_model	all-mpnet-base-v2	Pre-trained sentence transformer
lora_r	8	LoRA rank (lower = fewer parameters)
lora_alpha	16	LoRA scaling factor
lora_dropout	0.1	Dropout rate for regularization
target_modules	all-linear	Universal targeting for any transformer

Gradient checkpointing is enabled by default to reduce memory usage during backpropagation, which is critical for large batch sizes or limited GPU memory.

4. Mixture-of-Experts Fusion¶

The relative importance of text channels varies across NAICS codes. For some codes, the title and description suffice; for others, the examples are most illustrative; for nuanced codes, the excluded field is critical for disambiguation. A static fusion strategy cannot adapt to this heterogeneity.

Why MoE Over Alternatives¶

Three fusion strategies were evaluated: learned weighted average (static), gated attention (continuous dynamic control), and Mixture-of-Experts (discrete dynamic selection). MoE provides the most powerful paradigm because it offers coarse-grained selection between multiple specialized processing paths, not just dynamic weighting.

The MoE framework allows the model to effectively perform a learned architectural search, discovering experts optimized for different input types. One expert might specialize in ambiguity resolution (up-weighting the excluded channel), while another becomes a general classification expert focused on title and description.

Architecture¶

Component	Configuration	Function
Input	embedding_dim × 4	Concatenated channel embeddings
Gating Network	Linear(input → num_experts)	Computes expert selection scores
Top-K Selection	k = 2	Selects 2 most relevant experts
Expert Networks	4 × 2-layer MLP	Linear→ReLU→Dropout→Linear
Hidden Dim	1024	Expert network hidden dimension
Output Projection	Linear(input → embedding_dim)	Projects back to embedding space

Load Balancing Loss¶

Without correction, gating networks favor a small subset of "winning" experts, causing mode collapse. An auxiliary load balancing loss ensures even utilization:

L_aux = α · N · Σ(f_i · P_i)

Where N is the number of experts, α = 0.01 (default coefficient), f_i is the fraction of tokens routed to expert i, and P_i is the average gating probability for expert i.

Global-Batch vs. Micro-Batch Statistics¶

A critical implementation detail: the auxiliary loss must be calculated on global-batch statistics, not micro-batch. Micro-batch balancing forces the router to balance within each sequence, hindering domain specialization. Global-batch balancing allows the router to send all "manufacturing" codes to Expert 1 and all "healthcare" codes to Expert 2, as long as total utilization remains balanced across the entire diverse batch.

This requires synchronizing expert utilization counts (f_i) and router probabilities (P_i) across all distributed workers via AllReduce before computing the loss.

5. Hyperbolic Geometry & Lorentz Model¶

The Geometric Mismatch Problem¶

Attempting to embed hierarchical data into Euclidean space faces a fundamental geometric incompatibility: the number of nodes in a tree grows exponentially with depth (~ b^L for branching factor b and depth L), while the volume of a Euclidean ball grows only polynomially with radius (~ r^d). This disparity inevitably leads to distortion.

Hyperbolic geometry provides a principled solution. Hyperbolic spaces have constant negative curvature, causing volume to grow exponentially with radius (~ e^r). This makes hyperbolic space a natural, parsimonious, low-distortion environment for embedding hierarchies.

The Lorentz Model¶

Two common models of hyperbolic space are the Poincaré Ball and the Lorentz (Hyperboloid) Model. The Lorentz model is chosen for its superior numerical stability—the Poincaré model suffers from "the NaN problem" as embeddings approach the boundary.

The Lorentz model represents points as (x₀, x₁, ..., xₙ) on a hyperboloid satisfying:

-x₀² + x₁² + ... + xₙ² = -1/c

Where x₀ is the time coordinate (hyperbolic radius), x₁...xₙ are spatial coordinates, and c is the curvature parameter (default: c = 1.0).

Key Operations¶

Lorentz Inner Product:

⟨u, v⟩_L = u₁v₁ + ... + uₙvₙ - u₀v₀

Lorentzian Distance (Geodesic):

d(u, v) = √c · arccosh(-⟨u, v⟩_L)

Exponential Map (Tangent → Hyperboloid):

x₀ = cosh(||v|| / √c)
x_rest = (sinh(||v|| / √c) · v) / ||v||

Hyperbolic Projection Implementation¶

The fused Euclidean embedding is projected onto the hyperboloid via a linear projection followed by the exponential map at the origin. The projection adds the time coordinate dimension (embedding_dim → embedding_dim + 1) and ensures points satisfy the Lorentz constraint through numerically stable clamping.

Hyperbolic Utilities (utils/hyperbolic.py)¶

The system provides high-level abstractions for hyperbolic geometry through the utils/hyperbolic module:

LorentzManifold: Extended Lorentz operations with validation and projection.

minkowski_dot: Minkowski inner product ⟨x, y⟩_L
lorentz_norm_squared: Squared Lorentz norm ⟨x, x⟩_L
project_to_hyperboloid: Ensures constraint ⟨x, x⟩_L = -1/c
check_on_manifold: Validates points lie on hyperboloid
exp_map_zero / log_map_zero: Maps between tangent space and manifold
distance: Geodesic distance computation
parallel_transport: Transport tangent vectors between points

CurvatureManager: Phase-aware curvature management.

Phase	Curvature	Behavior
Phase 1	Fixed high (2.0)	Anchoring structure
Phases 2-4	Learnable	Adapts to data

ManifoldAdapter: Wrapper for consistent hyperbolic operations with automatic projection and validation.

from naics_embedder.utils.hyperbolic import ManifoldAdapter, CurvatureConfig

adapter = ManifoldAdapter(
    curvature_config=CurvatureConfig(phase1_curvature=2.0),
    validate_manifold=True,
    auto_project=True,
)
adapter.set_phase(2)  # Enable learnable curvature
x_hyp = adapter.to_hyperboloid(tangent_vectors)

6. Contrastive Learning Framework¶

Decoupled Contrastive Learning (DCL)¶

The system uses Decoupled Contrastive Learning rather than standard InfoNCE. DCL decouples the positive and negative terms for improved gradient flow and numerical stability:

pos_sim = -d(anchor, positive) / τ
neg_sims = [-d(anchor, negative_i) / τ for all i]
L = (-pos_sim + logsumexp(neg_sims)).mean()

Where τ is the temperature parameter (default: 0.07). In hyperbolic space, similarity is defined as negative Lorentzian distance.

Key Differences from InfoNCE¶

Aspect	InfoNCE	DCL
Formulation	log(exp(pos) / Σexp(all))	-pos + logsumexp(neg)
Coupling	Positive in denominator	Decoupled terms
Loss Range	Always ≥ 0	Can be negative
Gradient Flow	Coupled gradients	Independent gradients

Gradient Analysis¶

The gradient magnitude with respect to a negative sample n is proportional to its probability weight in the softmax distribution:

w_in = exp(-d_L(z_i, z_n) / τ) / Z_i

This mathematical structure dictates the informational value of negatives. Easy negatives (d >> d_pos) contribute near-zero gradient. Hard negatives (d ≈ d_pos) provide strong learning signal. Collapsing negatives (d < d_pos) represent current errors and yield maximal gradients.

False Negative Masking¶

When false negatives are detected (samples from different but semantically related classes), they are masked from the loss computation using the elimination strategy—setting their similarities to -∞ rather than re-categorizing them as positives. This is more robust to noise in pseudo-labels.

neg_similarities = neg_similarities.masked_fill(false_negative_mask, -inf)

7. Sampling Strategies¶

The sampling strategy fundamentally governs learning dynamics. In dense hierarchical taxonomies like NAICS, the definition of "negative" is fluid and context-dependent.

The Gradient-Semantic Trade-off¶

Standard contrastive learning treats all negatives equally. However, a model initialized with random weights will immediately separate "Farming" from "Programming" based on coarse lexical features. Triplets with distant negatives quickly satisfy the margin condition, driving loss to zero and extinguishing gradient signal.

To learn fine-grained features distinguishing "Custom Programming" from "Systems Design", sampling must mine negatives from the local neighborhood—"cousins" and "siblings" of the hierarchy. Yet pushing semantically proximal nodes apart risks shattering cluster structure.

Negative Type Taxonomy¶

Type	Tree Distance	Gradient	Risk	Recommendation
Siblings	d = 2	Very High	False Negative	Mask in Phase 1
Cousins	d = 4	High	Low	Optimal negatives
2nd Cousins	d = 6	Medium	Very Low	Good negatives
Distant	d ≥ 8	Near Zero	None	Low utility

Hard Negative Mining¶

Embedding-based hard negative mining dynamically selects negatives that are currently close to the anchor in hyperbolic space. The LorentzianHardNegativeMiner computes distances to all candidate negatives and selects the top-k with smallest distances.

This adapts to the model's current state, targeting exact boundaries where the model is confused. However, it risks the "False Negative Trap" in hierarchical data—embeddings closest to an anchor are likely siblings or cousins, which are semantically similar.

Router-Guided Sampling¶

Router-guided sampling selects negatives that maximize confusion in the MoE gating network. If the router sends anchor and negative to the same experts with similar confidence, they are "computationally indistinguishable." Using these as contrastive negatives forces experts to become more discriminative and combats mode collapse.

Global Batch Sampling¶

A local micro-batch (e.g., size 32 per GPU) is statistically unlikely to contain "Cousin" negatives (distance-4). Cross-device negative sampling gathers embeddings from all GPUs to create a larger candidate pool, enabling selection of meaningful hard negatives.

8. Structure-Aware Dynamic Curriculum (SADC)¶

The optimal sampling strategy is not a single static configuration but a dynamic, structure-aware process that evolves over training. The SADC implements three phases:

Phase 1: Structural Initialization (0-30%)¶

Objective: Establish global topology and local clustering based on the explicit NAICS tree.

Strategy: Tree-Distance Weighted Sampling with Sibling Masking

P_S1(n|a) ∝ 1/d_tree(a,n)^α · 𝟙(d_tree(a,n) > 2)

Inverse distance weighting (α ≈ 1.5) biases selection toward "Cousins" (d=4). Siblings (d=2) are explicitly masked—treating siblings as negatives early in training is dangerous because the model lacks feature maturity to distinguish them subtly.

Curriculum Flags: use_tree_distance=True, mask_siblings=True

Objective: Refine decision boundaries using the learned metric space.

Strategy: Annealed Hard Negative Mining in Lorentz Space

As the embedding space matures, transition from symbolic tree priors to learned semantics. Sample a candidate pool, then select top-k negatives minimizing Lorentzian distance. Router-guided sampling is also enabled to force expert specialization.

Curriculum Flags: enable_hard_negative_mining=True, enable_router_guided_sampling=True

Phase 3: False Negative Mitigation (70-100%)¶

Objective: Clean embedding space of artifacts; resolve semantic ambiguities.

Strategy: Clustering-Based False Negative Elimination (FNE)

Periodically freeze the encoder and perform Hyperbolic K-Means clustering. Assign cluster IDs as pseudo-labels. When sampling negatives, if Cluster(anchor) == Cluster(negative), eliminate that negative from the loss. This accepts that some distinct codes are semantically identical and stops fighting the data.

Curriculum Flags: enable_clustering=True

Phase Transition Summary¶

Phase	Epochs	Key Features	Goal
1	0-30%	Tree-distance weighting, sibling masking	Build skeleton
2	30-70%	Hard negative mining, router-guided sampling	Refine shape
3	70-100%	Clustering-based FNE	Clean artifacts

9. False Negative Mitigation¶

The False Negative Problem¶

In contrastive learning, a "false negative" is a sample treated as negative despite being semantically similar to the anchor. This problem is acute for NAICS: given anchor 541511 (Custom Computer Programming), sibling 541512 (Computer Systems Design) is semantically very close. Standard contrastive loss would incorrectly apply repulsive force, damaging hierarchical structure.

The detrimental effect is pronounced in large-scale datasets with high semantic concept density—a perfect description of NAICS. Consequences include discarding valuable shared semantic information and slowed convergence.

Why Curriculum-Based Detection¶

Attempting false negative detection too early is counterproductive. In initial training, the embedding space is largely random—any "semantic neighbors" identified via clustering would be spurious. The detection mechanism should activate only after the embedding space has stabilized (typically 70% of training).

This creates a self-correction loop: the model first learns coarse representations, then uses that emergent structure to identify and correct inconsistencies in its own training objective, then refines representations based on this more accurate objective.

Detection via Hyperbolic K-Means¶

Unlike standard Euclidean K-Means, the system uses Hyperbolic K-Means operating directly in Lorentz space. This is more appropriate for hyperbolic embeddings and preserves geometric structure during clustering.

Parameter	Default	Description
n_clusters	500	Number of semantic clusters
curvature	1.0	Lorentz model curvature
max_iter	100	Maximum K-Means iterations
tol	1e-4	Convergence tolerance
update_frequency	5 epochs	Re-clustering interval in Phase 3

Elimination vs. Attraction Strategy¶

Two mitigation strategies exist after identifying false negatives. Elimination removes false negatives from the denominator—the model ignores them. Attraction re-categorizes them as positives in the numerator—the model pulls them closer.

Research indicates attraction is less tolerant to noise in pseudo-labels. Since clustering-based detection inevitably produces some noise, elimination is the recommended and implemented strategy.

10. Additional Loss Components¶

Beyond the primary DCL contrastive loss, the system includes several auxiliary losses to enforce specific geometric and structural properties:

Hierarchy Preservation Loss¶

Directly optimizes embedding distances to match ground-truth tree distances:

L_hierarchy = weight · MSE(d_embedding, d_tree)

For each pair of codes in the batch, the loss penalizes deviations between Lorentzian geodesic distance and NAICS tree distance. Default weight: 0.325.

LambdaRank Loss (Rank Order Preservation)¶

Global ranking optimization using LambdaRank to preserve rank-order relationships.

Unlike pairwise losses, LambdaRank optimizes NDCG@k (Normalized Discounted Cumulative Gain), weighting pairs by their impact on ranking position. This provides position-aware optimization considering all pairs, not just anchor-positive-negative triplets. Default weight: 0.275.

Radius Regularization¶

Prevents hyperbolic embeddings from collapsing to the origin or expanding too far:

L_radius = weight · ||r - target_radius||²

Where r is the hyperbolic radius (time coordinate x₀). Default weight: 0.01.

MoE Load Balancing Loss¶

As described in Section 4, ensures even expert utilization:

L_aux = α · N · Σ(f_i · P_i)

Default coefficient α = 0.01.

Total Loss¶

L_total = L_DCL + L_hierarchy + L_lambdarank + L_radius + L_load_balancing

Loss Component	Default Weight	Purpose
DCL Contrastive	1.0 (implicit)	Primary representation learning
Hierarchy Preservation	0.325	Tree structure alignment
LambdaRank	0.275	Rank-order preservation
Radius Regularization	0.01	Embedding stability
Load Balancing	0.01	Expert utilization balance

11. Evaluation Metrics¶

The system computes comprehensive evaluation metrics during training to monitor hierarchy preservation, embedding quality, and potential failure modes.

Hierarchy Preservation Metrics¶

Metric	Description	Ideal Value
Cophenetic Correlation	Correlation between embedding and tree distances	→ 1.0
Spearman Correlation	Rank-order correlation of distance pairs	→ 1.0
NDCG@5	Ranking quality (top 5 neighbors)	→ 1.0
NDCG@10	Ranking quality (top 10 neighbors)	→ 1.0
NDCG@20	Ranking quality (top 20 neighbors)	→ 1.0
Mean Distortion	Average distance distortion from tree	→ 0.0

Hyperbolic Geometry Metrics¶

Metric	Description	Notes
Lorentz Norm Mean	Average ⟨x,x⟩_L across embeddings	Should be ≈ -1/c
Lorentz Norm Violations	Points violating hyperboloid constraint	Should be 0
Hyperbolic Radius Mean	Average x₀ (time coordinate)	Indicates hierarchy depth
Hyperbolic Radius Std	Standard deviation of radii	Indicates spread

Collapse Detection¶

The system monitors for embedding collapse, where all embeddings converge to a single point or small region, indicating training failure:

Metric	Description	Warning Threshold
Norm CV	Coefficient of variation of norms	< 0.1 indicates collapse
Distance CV	Coefficient of variation of pairwise distances	< 0.1 indicates collapse
Variance Collapse	Boolean flag for detected collapse	True = problem

12. Distributed Training¶

Multi-GPU Support¶

The system supports distributed training with automatic global batch sampling. Key features:

Global Negative Gathering: When hard negative mining or router-guided sampling is enabled, negative embeddings are gathered from all GPUs using torch.distributed.all_gather. This creates a much larger candidate pool for hard negative selection.

Gradient Flow: The implementation preserves gradients through all_gather operations. During backpropagation, gradients are scattered back to each rank, ensuring all GPUs receive gradient updates for their embeddings.

Global-Batch Load Balancing: Expert utilization statistics are synchronized across all workers via AllReduce before computing the auxiliary loss, enabling true domain specialization.

Memory Management¶

The system monitors and logs VRAM usage for distributed operations:

Metric	Example (batch=32, world=4, k=24)
train/global_batch/global_negatives_memory_mb	~9 MB per GPU
train/global_batch/similarity_matrix_memory_mb	~393 KB per batch
train/global_batch/global_batch_size	128 (32 × 4)
train/global_batch/global_k_negatives	96 (24 × 4)

13. Sampling Architecture¶

Data Layer (Streaming Dataset): Builds candidate pools, applies Phase 1 inverse tree-distance weighting, masks siblings, and prioritizes explicit exclusions. Negatives carry explicit_exclusion flags for downstream logging.
Model Layer (NAICSContrastiveModel): Performs Phase 2+ mining (embedding-based, router-guided), norm-adaptive margins, and Phase 3 false-negative masking. Curriculum flags control which mechanisms are active.
Interface: Data layer supplies pre-weighted negatives and metadata; model reshapes/reorders negatives for harder sampling and logs tree-distance and router confusion metrics.
See docs/sampling_architecture.md for full details.

13. Implementation Reference¶

Key Modules¶

Module	Location	Purpose
`NAICSContrastiveModel`	`text_model/naics_model.py`	Main Lightning module (mixin-based)
`MultiChannelEncoder`	`text_model/encoder.py`	4-channel text encoding
`MixtureOfExperts`	`text_model/moe.py`	MoE fusion layer
`HyperbolicProjection`	`text_model/hyperbolic.py`	Lorentz projection
`LorentzDistance`	`text_model/hyperbolic.py`	Geodesic distance
`LorentzOps`	`text_model/hyperbolic.py`	Static utility class for Lorentz operations
`HyperbolicInfoNCELoss`	`text_model/loss.py`	DCL implementation
`HierarchyPreservationLoss`	`text_model/loss.py`	Tree alignment loss
`LambdaRankLoss`	`text_model/loss.py`	Ranking loss
`CurriculumScheduler`	`text_model/curriculum.py`	SADC phase management
`HyperbolicKMeans`	`text_model/hyperbolic_clustering.py`	Lorentz clustering
`LorentzianHardNegativeMiner`	`text_model/hard_negative_mining.py`	HNM in hyperbolic space
`RouterGuidedNegativeMiner`	`text_model/hard_negative_mining.py`	Router-confusion mining
`NormAdaptiveMargin`	`text_model/hard_negative_mining.py`	Sech-based adaptive margins

Model Mixins¶

The NAICSContrastiveModel is decomposed into functional mixins for maintainability:

Mixin	Location	Purpose
`DistributedMixin`	`text_model/mixins/distributed.py`	Global batch sampling for multi-GPU
`LossMixin`	`text_model/mixins/loss.py`	Loss computation (hierarchy, LambdaRank, radius)
`CurriculumMixin`	`text_model/mixins/curriculum.py`	Hard negative mining, router-guided sampling
`LoggingMixin`	`text_model/mixins/logging.py`	Training and validation metric logging
`ValidationMixin`	`text_model/mixins/validation.py`	Validation step and evaluation logic
`OptimizerMixin`	`text_model/mixins/optimizer.py`	Optimizer and scheduler configuration

torch.compile Support¶

Core hyperbolic operations are optimized using PyTorch 2.0+ torch.compile for improved throughput via kernel fusion:

Module	Location	Purpose
`CompileConfig`	`utils/compile.py`	Compile mode and backend configuration
`CompiledLorentzOps`	`utils/compile.py`	Drop-in compiled replacement for LorentzOps
`maybe_compile`	`utils/compile.py`	Conditional compilation decorator

Compiled operations:

compiled_exp_map_zero: Exponential map from tangent space to hyperboloid
compiled_log_map_zero: Logarithmic map from hyperboloid to tangent space
compiled_lorentz_distance: Geodesic distance computation
compiled_minkowski_dot: Minkowski inner product
compiled_project_to_hyperboloid: Projection onto Lorentz manifold

Configuration:

from naics_embedder.utils.compile import CompileConfig, set_compile_config

# Configure compile behavior
config = CompileConfig(
    enabled=True,               # Enable torch.compile (requires PyTorch 2.0+)
    mode='reduce-overhead',     # Best for small tensors and repeated calls
    backend='inductor',         # Default, best performance
    dynamic=True,               # Support varying batch sizes
)
set_compile_config(config)

Compilation can be disabled via environment variable: NAICS_DISABLE_COMPILE=1

Default Hyperparameters¶

Category	Parameter	Default
Model	base_model_name	all-mpnet-base-v2
LoRA	r / alpha / dropout	8 / 16 / 0.1
MoE	num_experts / top_k / hidden_dim	4 / 2 / 1024
Loss	temperature / curvature	0.07 / 1.0
Loss Weights	hierarchy / rank_order / radius_reg / level_radius	0.45 / 0.35 / 0.15 / 0.05
MoE	load_balancing_coef	0.01
Training	learning_rate / weight_decay	2e-4 / 0.01
Training	warmup_steps	500
Curriculum	phase1_end / phase2_end	0.3 / 0.7
Clustering	n_clusters / update_freq	500 / 5 epochs

CLI Commands¶

# Data preprocessing
uv run naics-embedder data all

# Training
uv run naics-embedder train

Appendix A: Mathematical Notation Reference¶

Symbol	Meaning
⟨u, v⟩_L	Lorentz inner product
d_L(u, v)	Lorentzian geodesic distance
d_tree(a, n)	Tree distance (shortest path in NAICS taxonomy)
τ	Temperature parameter
c	Curvature parameter
x₀	Time coordinate (hyperbolic radius)
f_i	Fraction of tokens routed to expert i
P_i	Average gating probability for expert i
α	Load balancing coefficient

Appendix B: Literature References¶

Hyperbolic Deep Learning & Graph Neural Networks ## Hyperbolic Geometry

Chami et al. (2019). Hyperbolic Graph Convolutional Neural Networks.
Liu et al. (2019). Hyperbolic Graph Neural Networks.
Nickel & Kiela (2017). Poincaré Embeddings for Learning Hierarchical Representations.
Nickel & Kiela (2018). Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry.
Ganea et al. (2018). Hyperbolic Neural Networks.
Dai et al. (2021). A Hyperbolic-to-Hyperbolic Graph Convolutional Network.

Contrastive Learning

Yeh et al. (2022). Decoupled Contrastive Learning.
Chen et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations.
Khosla et al. (2020). Supervised Contrastive Learning.
Ge et al. (2023). Hyperbolic Contrastive Learning for Visual Representations beyond Objects.
Robinson et al. (2021). Contrastive Learning with Hard Negative Samples.
Zhang et al. (2022). Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework.
Ahrabian et al. (2020). Structure Aware Negative Sampling in Knowledge Graphs.
Alon et al. (2024). Optimal Sample Complexity of Contrastive Learning.

Mixture-of-Experts

Shazeer et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
Fedus et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
Jacobs et al. (1991). Adaptive Mixtures of Local Experts.

NAICS & Industry Classification

Whitehead & Dumbacher (2024). Ensemble Modeling Techniques for NAICS Classification in the Economic Census.
Vidali et al. (2024). Unlocking NACE Classification Embeddings with OpenAI for Enhanced Analysis.

Text Encoding & Parameter Efficiency

Vaswani et al. (2017). Attention Is All You Need.
Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models.
Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search