Data Validation

Pre-flight validation utilities for data files and configuration.

Overview

The validation module provides functions to check data files, tokenization caches, and configuration consistency before training begins. Early validation prevents runtime surprises from missing files or incompatible data.

Usage

from naics_embedder.utils.validation import (
    validate_training_config,
    require_valid_config,
    ValidationError,
)

# Validate and get result
result = validate_training_config(cfg)
if not result.valid:
    for error in result.errors:
        print(f"Error: {error}")

# Or raise on failure
try:
    require_valid_config(cfg)
except ValidationError as e:
    print(e.message)
    for step in e.remediation:
        print(f"  - {step}")

Validation Checks

The validation system checks:

  1. Data Paths - Required parquet files exist
  2. Schema Validation - Parquet files have expected columns
  3. Tokenization Cache - Cache exists and has correct structure
  4. Configuration Consistency - Settings are compatible

API Reference

Pre-flight validation utilities for NAICS Embedder.

This module provides validation functions that check data files, tokenization caches, and configuration consistency before training or embedding generation begins. Early validation prevents runtime surprises from missing files or incompatible data.

Functions:

Name Description
validate_data_paths

Verify required data files exist and are accessible.

validate_parquet_schema

Check parquet file has expected columns.

validate_tokenization_cache

Verify tokenization cache compatibility.

validate_training_config

Comprehensive pre-flight validation for training.

ValidationError

Exception for validation failures with remediation steps.

ValidationError

Bases: Exception

Exception raised when validation fails.

Includes actionable remediation steps to help users fix the issue.

Attributes:

Name Type Description
message

Description of the validation failure.

remediation

List of suggested steps to fix the issue.

details

Optional additional details about the failure.

ValidationResult dataclass

Result of a validation check.

Attributes:

Name Type Description
valid bool

Whether validation passed.

errors List[str]

List of error messages.

warnings List[str]

List of warning messages.

add_error(error)

Add an error and mark as invalid.

add_warning(warning)

Add a warning (does not affect validity).

failure(error) classmethod

Create a failed validation result with an error message.

merge(other)

Merge another result into this one.

success() classmethod

Create a successful validation result.

validate_data_paths(cfg)

Verify that required data files exist and are accessible.

Checks for the existence of description, distance, relation, and triplet parquet files required for training.

Parameters:

Name Type Description Default
cfg Config

Configuration containing data paths.

required

Returns:

Type Description
ValidationResult

ValidationResult indicating success or listing missing files.

Example

result = validate_data_paths(cfg) if not result.valid: ... for error in result.errors: ... print(f'Missing: {error}')

validate_parquet_schema(path, required_columns, file_description='parquet file')

Check that a parquet file has the expected columns.

Parameters:

Name Type Description Default
path str

Path to the parquet file.

required
required_columns Set[str]

Set of column names that must be present.

required
file_description str

Human-readable description for error messages.

'parquet file'

Returns:

Type Description
ValidationResult

ValidationResult with schema validation status.

Example

result = validate_parquet_schema( ... 'data/naics_descriptions.parquet', ... {'index', 'code', 'title', 'description'}, ... 'descriptions', ... )

validate_tokenization_cache(cfg, tokenization_cfg=None)

Verify that the tokenization cache exists and is compatible.

Checks that the cache file exists and was generated with compatible settings (tokenizer name, max length).

Parameters:

Name Type Description Default
cfg Config

Main configuration.

required
tokenization_cfg Optional[TokenizationConfig]

Optional tokenization-specific config.

None

Returns:

Type Description
ValidationResult

ValidationResult with cache validation status and regeneration

ValidationResult

instructions if needed.

Example

result = validate_tokenization_cache(cfg) if not result.valid: ... print('Cache needs regeneration')

validate_training_config(cfg)

Run comprehensive pre-flight validation for training.

Checks all data files, schemas, and configuration settings required for a successful training run.

Parameters:

Name Type Description Default
cfg Config

Training configuration to validate.

required

Returns:

Type Description
ValidationResult

ValidationResult with all validation checks combined.

Raises:

Type Description
ValidationError

If critical validation errors are found and raise_on_error=True.

Example

result = validate_training_config(cfg) if not result.valid: ... for error in result.errors: ... logger.error(error) ... raise SystemExit(1)

require_valid_config(cfg)

Validate config and raise if validation fails.

Convenience function that validates the configuration and raises a ValidationError with remediation steps if validation fails.

Parameters:

Name Type Description Default
cfg Config

Configuration to validate.

required

Raises:

Type Description
ValidationError

If any validation checks fail.

Example

require_valid_config(cfg) # Raises on failure