Data Validation¶
Pre-flight validation utilities for data files and configuration.
Overview¶
The validation module provides functions to check data files, tokenization
caches, and configuration consistency before training begins. Early validation
prevents runtime surprises from missing files or incompatible data.
Usage¶
from naics_embedder.utils.validation import (
validate_training_config,
require_valid_config,
ValidationError,
)
# Validate and get result
result = validate_training_config(cfg)
if not result.valid:
for error in result.errors:
print(f"Error: {error}")
# Or raise on failure
try:
require_valid_config(cfg)
except ValidationError as e:
print(e.message)
for step in e.remediation:
print(f" - {step}")
Validation Checks¶
The validation system checks:
- Data Paths - Required parquet files exist
- Schema Validation - Parquet files have expected columns
- Tokenization Cache - Cache exists and has correct structure
- Configuration Consistency - Settings are compatible
API Reference¶
Pre-flight validation utilities for NAICS Embedder.
This module provides validation functions that check data files, tokenization caches, and configuration consistency before training or embedding generation begins. Early validation prevents runtime surprises from missing files or incompatible data.
Functions:
| Name | Description |
|---|---|
validate_data_paths |
Verify required data files exist and are accessible. |
validate_parquet_schema |
Check parquet file has expected columns. |
validate_tokenization_cache |
Verify tokenization cache compatibility. |
validate_training_config |
Comprehensive pre-flight validation for training. |
ValidationError |
Exception for validation failures with remediation steps. |
ValidationError
¶
Bases: Exception
Exception raised when validation fails.
Includes actionable remediation steps to help users fix the issue.
Attributes:
| Name | Type | Description |
|---|---|---|
message |
Description of the validation failure. |
|
remediation |
List of suggested steps to fix the issue. |
|
details |
Optional additional details about the failure. |
ValidationResult
dataclass
¶
Result of a validation check.
Attributes:
| Name | Type | Description |
|---|---|---|
valid |
bool
|
Whether validation passed. |
errors |
List[str]
|
List of error messages. |
warnings |
List[str]
|
List of warning messages. |
add_error(error)
¶
Add an error and mark as invalid.
add_warning(warning)
¶
Add a warning (does not affect validity).
failure(error)
classmethod
¶
Create a failed validation result with an error message.
merge(other)
¶
Merge another result into this one.
success()
classmethod
¶
Create a successful validation result.
validate_data_paths(cfg)
¶
Verify that required data files exist and are accessible.
Checks for the existence of description, distance, relation, and triplet parquet files required for training.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Config
|
Configuration containing data paths. |
required |
Returns:
| Type | Description |
|---|---|
ValidationResult
|
ValidationResult indicating success or listing missing files. |
Example
result = validate_data_paths(cfg) if not result.valid: ... for error in result.errors: ... print(f'Missing: {error}')
validate_parquet_schema(path, required_columns, file_description='parquet file')
¶
Check that a parquet file has the expected columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the parquet file. |
required |
required_columns
|
Set[str]
|
Set of column names that must be present. |
required |
file_description
|
str
|
Human-readable description for error messages. |
'parquet file'
|
Returns:
| Type | Description |
|---|---|
ValidationResult
|
ValidationResult with schema validation status. |
Example
result = validate_parquet_schema( ... 'data/naics_descriptions.parquet', ... {'index', 'code', 'title', 'description'}, ... 'descriptions', ... )
validate_tokenization_cache(cfg, tokenization_cfg=None)
¶
Verify that the tokenization cache exists and is compatible.
Checks that the cache file exists and was generated with compatible settings (tokenizer name, max length).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Config
|
Main configuration. |
required |
tokenization_cfg
|
Optional[TokenizationConfig]
|
Optional tokenization-specific config. |
None
|
Returns:
| Type | Description |
|---|---|
ValidationResult
|
ValidationResult with cache validation status and regeneration |
ValidationResult
|
instructions if needed. |
Example
result = validate_tokenization_cache(cfg) if not result.valid: ... print('Cache needs regeneration')
validate_training_config(cfg)
¶
Run comprehensive pre-flight validation for training.
Checks all data files, schemas, and configuration settings required for a successful training run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Config
|
Training configuration to validate. |
required |
Returns:
| Type | Description |
|---|---|
ValidationResult
|
ValidationResult with all validation checks combined. |
Raises:
| Type | Description |
|---|---|
ValidationError
|
If critical validation errors are found and
|
Example
result = validate_training_config(cfg) if not result.valid: ... for error in result.errors: ... logger.error(error) ... raise SystemExit(1)
require_valid_config(cfg)
¶
Validate config and raise if validation fails.
Convenience function that validates the configuration and raises a ValidationError with remediation steps if validation fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Config
|
Configuration to validate. |
required |
Raises:
| Type | Description |
|---|---|
ValidationError
|
If any validation checks fail. |
Example
require_valid_config(cfg) # Raises on failure