NAICS Hyperbolic Embedding System — HGCN Refinement Guide¶
Overview¶
This document explains the final stage of the NAICS hyperbolic embedding pipeline: refinement using a Hyperbolic Graph Convolutional Network (HGCN).
1. Purpose of HGCN Refinement¶
Integrates NAICS taxonomy directly into embedding geometry.
2. Input Requirements¶
- Lorentz hyperbolic embeddings
- NAICS parent–child graph
- Level metadata
3. Running the Refinement¶
python train_hgcn.py --config configs/hgcn.yaml
4. HGCN Layer Operation¶
Each layer performs log-map, graph convolution in tangent space, activation, and exp-map.
5. Refinement Loss Functions¶
- Hyperbolic Triplet Loss
- Per-Level Radial Regularization
6. Learnable Curvature¶
Curvature parameter is optimized jointly.
7. Output of HGCN Refinement¶
Refined Lorentz-model hyperbolic embeddings aligned with taxonomy structure.
8. Validation Metrics¶
Stage 4 now mirrors the text-model evaluation suite so you can verify that graph refinement does not erode global structure:
- Cophenetic correlation – correlation between embedding distances and tree distances.
- Spearman correlation – rank-order agreement across the hierarchy.
- NDCG@K (default: 5/10/20) – position-aware ranking quality.
- Distortion stats – mean/std/median stretch between embedding and tree distances.
Metrics are logged once per validation run (default: every epoch). They require the precomputed tree distance matrix produced in Stage 2.
Configuration¶
Add the following keys to your GraphConfig (or configs/hgcn.yaml) to customize evaluation:
| Key | Description |
|---|---|
distance_matrix_parquet |
Path to naics_distance_matrix.parquet. Required to unlock hierarchy metrics. |
full_eval_frequency |
Run the expensive metrics every N optimizer steps (default 1, meaning every validation epoch). |
ndcg_k_values |
List of K values used for NDCG logging. |
If distance_matrix_parquet is missing, HGCN automatically skips the extra metrics and continues with the lightweight batch metrics (triplet accuracy, etc.).
9. Pre/Post Verification Workflow¶
After both Stage 3 and Stage 4 finish, run the automated comparison from Issue #67 to confirm that HGCN preserved the Stage 3 geometry:
uv run naics-embedder tools verify-stage4 \
--pre ./output/hyperbolic_projection/encodings.parquet \
--post ./output/hgcn/encodings.parquet
Additional options let you override the distance matrix, relations parquet, or the acceptable degradation thresholds:
| Option | Purpose |
|---|---|
--max-cophenetic-drop |
Maximum allowable decrease in cophenetic correlation (default 0.02). |
--max-ndcg-drop |
Maximum allowable decrease in NDCG@K (default 0.01). |
--min-local-improvement |
Required increase in parent retrieval accuracy (default 0.05). |
--ndcg-k |
Which K to evaluate for NDCG (default 10). |
--parent-top-k |
Size of the neighborhood used for parent retrieval (default 1). |
The command prints pre/post metrics, deltas, and PASS/FAIL indicators for each threshold. Integrate it into CI to prevent regressions before shipping updated embeddings.