vipr_reflectometry.flow_models.postprocess.cluster.clustering package¶

Subpackages¶

vipr_reflectometry.flow_models.postprocess.cluster.clustering.visualization package

Submodules¶

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms module¶

Clustering Algorithms and Validation Metrics.

Simplified version without excessive scaling/whitening/merging complexity.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_bayesian_gmm(samples: ndarray, n_components: int = 10, logger=None, *, seed: int | None = None, weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', covariance_type: str = 'full', reg_covar: float = 1e-06, n_init: int = 1, whiten: bool = True) → tuple[ndarray, Dict[str, Any]]¶

Bayesian GMM clustering with Dirichlet Process prior. Automatically determines the effective number of components.

Parameters:

samples – Parameter samples (N, D)
n_components – Upper bound on number of components
logger – Logger instance
seed – Random seed for reproducible fitting
weight_concentration_prior – Dirichlet concentration (None=auto, low=few clusters)
weight_concentration_prior_type – ‘dirichlet_process’ or ‘dirichlet_distribution’
covariance_type – Covariance structure (‘full’, ‘tied’, ‘diag’, ‘spherical’)
reg_covar – Covariance regularization
n_init – Number of initializations
whiten – Apply StandardScaler for equal parameter weighting

Returns:

labels: Cluster labels array (0 to K-1, never -1)
gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_gmm(samples: ndarray, n_components: int | None = None, logger=None, *, seed: int | None = None, k_max: int = 8, reg_covar: float = 1e-06, n_init: int = 3, covariance_type: str = 'full', whiten: bool = True) → tuple[ndarray, Dict[str, Any]]¶

Pure GMM clustering without pruning or noise detection.

K selection via BIC (if None)
Hard assignment via gmm.predict()
No -1/noise labels, no min_cluster_size filtering
Optional standardization for stable BIC-based K selection

Parameters:

samples – Parameter samples (N, D)
n_components – Fixed K (None = auto via BIC)
logger – Logger instance
seed – Random seed for reproducible GMM fitting
k_max – Maximum K to test for BIC
reg_covar – Covariance regularization
n_init – GMM initializations
whiten – Apply StandardScaler for equal parameter weighting

Returns:

labels: Cluster labels array (0 to K-1, never -1)
gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_hdbscan(samples: ndarray, min_cluster_size: int, min_samples: int, logger=None, *, seed: int | None = None, whiten: bool = True) → ndarray¶

HDBSCAN clustering with optional standardization.

Parameters:

samples – Array of shape (num_samples, num_params)
min_cluster_size – Minimum number of samples per cluster
min_samples – Minimum number of samples for core points
logger – Optional logger for debug messages
seed – Random seed for reproducibility (unused - HDBSCAN is deterministic)
whiten – Apply StandardScaler for equal parameter weighting

Returns:

Array of cluster labels (-1 for noise)

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.calculate_validation_metrics(samples: ndarray, cluster_labels: ndarray, logger=None) → Dict[str, Any]¶

Calculate validation metrics on scaled features.

Metrics are calculated on standardized features to match the clustering feature space when whiten=True was used.

Parameters:

samples – Parameter samples (num_samples, num_params)
cluster_labels – Cluster assignments
logger – Optional logger for warnings

Returns:

Dict with validation metrics and quality assessment

vipr_reflectometry.flow_models.postprocess.cluster.clustering.clustering module¶

Clustering Module - Orchestration Layer.

Pure orchestrator that coordinates the clustering workflow. Delegates algorithm execution to the algorithms module.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.clustering.ClusterProcessor(app)¶

Bases: object

Main clustering orchestrator that coordinates the workflow.

Responsibilities: - Handle batch processing (multiple spectra) - Coordinate algorithm execution (delegates to algorithms module) - Coordinate visualization and simulation

This class is a pure orchestrator - it doesn’t implement algorithms itself.

process(data: Dict[str, Any], method: str = 'gmm', n_components: int | None = 5, min_cluster_size: int = 50, min_samples: int = 10, seed: int | None = None, n_init: int = 10, covariance_type: str = 'full', weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', benchmark_mode: bool = False, enable_validation_plots: bool = True, enable_cluster_corner_plot: bool = True, enable_cluster_marginals: bool = True, enable_parallel_coordinates: bool = True, enable_centroid_plots: bool = True, enable_interactive_export: bool = True, **_) → Dict[str, Any]¶

Process clustering for a single spectrum.

Parameters:

data – Prediction results
method – Clustering method (‘hdbscan’ or ‘gmm’)
n_components – Number of GMM components
min_cluster_size – Minimum HDBSCAN cluster size
min_samples – Minimum HDBSCAN samples

Returns:

Enriched data with cluster information

vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook module¶

VIPR Filter Adapter for Clustering.

Thin adapter between VIPR filter hook and clustering service.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook.ClusterHook(app: VIPR)¶

Bases: object

Thin adapter between VIPR filter hook and clustering processor.

Responsibilities: - Register as VIPR filter - Validate input data - Delegate to service layer

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook.ClusterHookParams(*, method: str = 'gmm', n_components: int | None = 2, min_cluster_size: int = 50, min_samples: int = 10, seed: int = 42, n_init: int = 10, covariance_type: str = 'full', weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', polish_centroids: bool = False, benchmark_mode: bool = False, enable_validation_plots: bool = True, enable_cluster_corner_plot: bool = True, enable_cluster_marginals: bool = True, enable_parallel_coordinates: bool = True, enable_centroid_plots: bool = True, enable_interactive_export: bool = True)¶

Bases: BaseModel

Configuration for posterior-sample clustering.

benchmark_mode: bool¶

covariance_type: str¶

enable_centroid_plots: bool¶

enable_cluster_corner_plot: bool¶

enable_cluster_marginals: bool¶

enable_interactive_export: bool¶

enable_parallel_coordinates: bool¶

enable_validation_plots: bool¶

method: str¶

min_cluster_size: int¶

min_samples: int¶

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_components: int | None¶

n_init: int¶

polish_centroids: bool¶

seed: int¶

weight_concentration_prior: float | None¶

weight_concentration_prior_type: str¶

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation module¶

Forward Simulation Module.

Handles centroid calculation and Reflectorch forward simulation. Contains only simulation logic - visualization is in visualization.py.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation.simulate_centroids(app, samples: ndarray, cluster_labels: ndarray, cluster_sizes: List[Tuple[int, int]], spectrum_idx: int = 0, polish_centroids: bool = False) → Tuple[List[Dict] | None, Any, Any, Dict[str, float] | None]¶

Calculate cluster centroids and perform forward simulation.

This is a pure data transformation function that returns simulation results. The orchestrator (clustering.py) handles visualization.

Parameters:

app – VIPR application instance
samples – Parameter samples (num_samples, num_params)
cluster_labels – Cluster assignments
cluster_sizes – List of (label, size) tuples sorted by size
spectrum_idx – Spectrum index (always 0 in single-spectrum mode)

Returns:

Tuple of (centroid_results, q_values, original_data, timing_summary) - centroid_results: List of dicts with centroid info and simulated curves - q_values: Q-values for plotting - original_data: Original experimental data (if available) - timing_summary: timing information for centroid polishing Returns (None, None, None, None) if simulation fails

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation copy module¶

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation2 module¶

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation3 module¶

Module contents¶

Clustering Module.

Provides clustering algorithms, orchestration, and visualization for posterior samples.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.ClusterHook(app: VIPR)¶

Bases: object

Thin adapter between VIPR filter hook and clustering processor.

Responsibilities: - Register as VIPR filter - Validate input data - Delegate to service layer

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.ClusterProcessor(app)¶

Bases: object

Main clustering orchestrator that coordinates the workflow.

Responsibilities: - Handle batch processing (multiple spectra) - Coordinate algorithm execution (delegates to algorithms module) - Coordinate visualization and simulation

This class is a pure orchestrator - it doesn’t implement algorithms itself.

Process clustering for a single spectrum.

Parameters:

data – Prediction results
method – Clustering method (‘hdbscan’ or ‘gmm’)
n_components – Number of GMM components
min_cluster_size – Minimum HDBSCAN cluster size
min_samples – Minimum HDBSCAN samples

Returns:

Enriched data with cluster information

vipr_reflectometry.flow_models.postprocess.cluster.clustering.apply_gmm(samples: ndarray, n_components: int | None = None, logger=None, *, seed: int | None = None, k_max: int = 8, reg_covar: float = 1e-06, n_init: int = 3, covariance_type: str = 'full', whiten: bool = True) → tuple[ndarray, Dict[str, Any]]¶

Pure GMM clustering without pruning or noise detection.

K selection via BIC (if None)
Hard assignment via gmm.predict()
No -1/noise labels, no min_cluster_size filtering
Optional standardization for stable BIC-based K selection

Parameters:

samples – Parameter samples (N, D)
n_components – Fixed K (None = auto via BIC)
logger – Logger instance
seed – Random seed for reproducible GMM fitting
k_max – Maximum K to test for BIC
reg_covar – Covariance regularization
n_init – GMM initializations
whiten – Apply StandardScaler for equal parameter weighting

Returns:

labels: Cluster labels array (0 to K-1, never -1)
gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.apply_hdbscan(samples: ndarray, min_cluster_size: int, min_samples: int, logger=None, *, seed: int | None = None, whiten: bool = True) → ndarray¶

HDBSCAN clustering with optional standardization.

Parameters:

samples – Array of shape (num_samples, num_params)
min_cluster_size – Minimum number of samples per cluster
min_samples – Minimum number of samples for core points
logger – Optional logger for debug messages
seed – Random seed for reproducibility (unused - HDBSCAN is deterministic)
whiten – Apply StandardScaler for equal parameter weighting

Returns:

Array of cluster labels (-1 for noise)

vipr_reflectometry.flow_models.postprocess.cluster.clustering.calculate_validation_metrics(samples: ndarray, cluster_labels: ndarray, logger=None) → Dict[str, Any]¶

Calculate validation metrics on scaled features.

Metrics are calculated on standardized features to match the clustering feature space when whiten=True was used.

Parameters:

samples – Parameter samples (num_samples, num_params)
cluster_labels – Cluster assignments
logger – Optional logger for warnings

Returns:

Dict with validation metrics and quality assessment