vipr_reflectometry.flow_models.postprocess.cluster.clustering package

Subpackages

Submodules

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms module

Clustering Algorithms and Validation Metrics.

Simplified version without excessive scaling/whitening/merging complexity.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_bayesian_gmm(samples: ndarray, n_components: int = 10, logger=None, *, seed: int | None = None, weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', covariance_type: str = 'full', reg_covar: float = 1e-06, n_init: int = 1, whiten: bool = True) tuple[ndarray, Dict[str, Any]]

Bayesian GMM clustering with Dirichlet Process prior. Automatically determines the effective number of components.

Parameters:
  • samples – Parameter samples (N, D)

  • n_components – Upper bound on number of components

  • logger – Logger instance

  • seed – Random seed for reproducible fitting

  • weight_concentration_prior – Dirichlet concentration (None=auto, low=few clusters)

  • weight_concentration_prior_type – ‘dirichlet_process’ or ‘dirichlet_distribution’

  • covariance_type – Covariance structure (‘full’, ‘tied’, ‘diag’, ‘spherical’)

  • reg_covar – Covariance regularization

  • n_init – Number of initializations

  • whiten – Apply StandardScaler for equal parameter weighting

Returns:

  • labels: Cluster labels array (0 to K-1, never -1)

  • gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_gmm(samples: ndarray, n_components: int | None = None, logger=None, *, seed: int | None = None, k_max: int = 8, reg_covar: float = 1e-06, n_init: int = 3, covariance_type: str = 'full', whiten: bool = True) tuple[ndarray, Dict[str, Any]]
Pure GMM clustering without pruning or noise detection.
  • K selection via BIC (if None)

  • Hard assignment via gmm.predict()

  • No -1/noise labels, no min_cluster_size filtering

  • Optional standardization for stable BIC-based K selection

Parameters:
  • samples – Parameter samples (N, D)

  • n_components – Fixed K (None = auto via BIC)

  • logger – Logger instance

  • seed – Random seed for reproducible GMM fitting

  • k_max – Maximum K to test for BIC

  • reg_covar – Covariance regularization

  • n_init – GMM initializations

  • whiten – Apply StandardScaler for equal parameter weighting

Returns:

  • labels: Cluster labels array (0 to K-1, never -1)

  • gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.apply_hdbscan(samples: ndarray, min_cluster_size: int, min_samples: int, logger=None, *, seed: int | None = None, whiten: bool = True) ndarray

HDBSCAN clustering with optional standardization.

Parameters:
  • samples – Array of shape (num_samples, num_params)

  • min_cluster_size – Minimum number of samples per cluster

  • min_samples – Minimum number of samples for core points

  • logger – Optional logger for debug messages

  • seed – Random seed for reproducibility (unused - HDBSCAN is deterministic)

  • whiten – Apply StandardScaler for equal parameter weighting

Returns:

Array of cluster labels (-1 for noise)

vipr_reflectometry.flow_models.postprocess.cluster.clustering.algorithms.calculate_validation_metrics(samples: ndarray, cluster_labels: ndarray, logger=None) Dict[str, Any]

Calculate validation metrics on scaled features.

Metrics are calculated on standardized features to match the clustering feature space when whiten=True was used.

Parameters:
  • samples – Parameter samples (num_samples, num_params)

  • cluster_labels – Cluster assignments

  • logger – Optional logger for warnings

Returns:

Dict with validation metrics and quality assessment

vipr_reflectometry.flow_models.postprocess.cluster.clustering.clustering module

Clustering Module - Orchestration Layer.

Pure orchestrator that coordinates the clustering workflow. Delegates algorithm execution to the algorithms module.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.clustering.ClusterProcessor(app)

Bases: object

Main clustering orchestrator that coordinates the workflow.

Responsibilities: - Handle batch processing (multiple spectra) - Coordinate algorithm execution (delegates to algorithms module) - Coordinate visualization and simulation

This class is a pure orchestrator - it doesn’t implement algorithms itself.

process(data: Dict[str, Any], method: str = 'gmm', n_components: int | None = 5, min_cluster_size: int = 50, min_samples: int = 10, seed: int | None = None, n_init: int = 10, covariance_type: str = 'full', weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', **_) Dict[str, Any]

Process clustering for a single spectrum.

Parameters:
  • data – Prediction results

  • method – Clustering method (‘hdbscan’ or ‘gmm’)

  • n_components – Number of GMM components

  • min_cluster_size – Minimum HDBSCAN cluster size

  • min_samples – Minimum HDBSCAN samples

Returns:

Enriched data with cluster information

vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook module

VIPR Filter Adapter for Clustering.

Thin adapter between VIPR filter hook and clustering service.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook.ClusterHook(app: VIPR)

Bases: object

Thin adapter between VIPR filter hook and clustering processor.

Responsibilities: - Register as VIPR filter - Validate input data - Delegate to service layer

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.hook.ClusterHookParams(*, method: str = 'gmm', n_components: int | None = 2, min_cluster_size: int = 50, min_samples: int = 10, seed: int = 42, n_init: int = 10, covariance_type: str = 'full', weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', polish_centroids: bool = False)

Bases: BaseModel

Configuration for posterior-sample clustering.

covariance_type: str
method: str
min_cluster_size: int
min_samples: int
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_components: int | None
n_init: int
polish_centroids: bool
seed: int
weight_concentration_prior: float | None
weight_concentration_prior_type: str

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation module

Forward Simulation Module.

Handles centroid calculation and Reflectorch forward simulation. Contains only simulation logic - visualization is in visualization.py.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation.simulate_centroids(app, samples: ndarray, cluster_labels: ndarray, cluster_sizes: List[Tuple[int, int]], spectrum_idx: int = 0, polish_centroids: bool = False) Tuple[List[Dict] | None, Any, Any]

Calculate cluster centroids and perform forward simulation.

This is a pure data transformation function that returns simulation results. The orchestrator (clustering.py) handles visualization.

Parameters:
  • app – VIPR application instance

  • samples – Parameter samples (num_samples, num_params)

  • cluster_labels – Cluster assignments

  • cluster_sizes – List of (label, size) tuples sorted by size

  • spectrum_idx – Spectrum index (always 0 in single-spectrum mode)

Returns:

Tuple of (centroid_results, q_values, original_data) - centroid_results: List of dicts with centroid info and simulated curves - q_values: Q-values for plotting - original_data: Original experimental data (if available) Returns (None, None, None) if simulation fails

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation copy module

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation2 module

Forward Simulation Module.

Handles centroid calculation and Reflectorch forward simulation. Contains only simulation logic - visualization is in visualization.py.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation2.simulate_centroids(app, samples: ndarray, cluster_labels: ndarray, cluster_sizes: List[Tuple[int, int]], spectrum_idx: int = 0, polish_centroids: bool = False) Tuple[List[Dict] | None, Any, Any]

Calculate cluster centroids and perform forward simulation.

This is a pure data transformation function that returns simulation results. The orchestrator (clustering.py) handles visualization.

Parameters:
  • app – VIPR application instance

  • samples – Parameter samples (num_samples, num_params)

  • cluster_labels – Cluster assignments

  • cluster_sizes – List of (label, size) tuples sorted by size

  • spectrum_idx – Spectrum index (always 0 in single-spectrum mode)

Returns:

Tuple of (centroid_results, q_values, original_data) - centroid_results: List of dicts with centroid info and simulated curves - q_values: Q-values for plotting - original_data: Original experimental data (if available) Returns (None, None, None) if simulation fails

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation3 module

Forward Simulation Module.

Handles centroid calculation and Reflectorch forward simulation. Contains only simulation logic - visualization is in visualization.py.

vipr_reflectometry.flow_models.postprocess.cluster.clustering.simulation3.simulate_centroids(app, samples: ndarray, cluster_labels: ndarray, cluster_sizes: List[Tuple[int, int]], spectrum_idx: int = 0, polish_centroids: bool = False) Tuple[List[Dict] | None, Any, Any]

Calculate cluster centroids and perform forward simulation.

This is a pure data transformation function that returns simulation results. The orchestrator (clustering.py) handles visualization.

Parameters:
  • app – VIPR application instance

  • samples – Parameter samples (num_samples, num_params)

  • cluster_labels – Cluster assignments

  • cluster_sizes – List of (label, size) tuples sorted by size

  • spectrum_idx – Spectrum index (always 0 in single-spectrum mode)

Returns:

Tuple of (centroid_results, q_values, original_data) - centroid_results: List of dicts with centroid info and simulated curves - q_values: Q-values for plotting - original_data: Original experimental data (if available) Returns (None, None, None) if simulation fails

Module contents

Clustering Module.

Provides clustering algorithms, orchestration, and visualization for posterior samples.

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.ClusterHook(app: VIPR)

Bases: object

Thin adapter between VIPR filter hook and clustering processor.

Responsibilities: - Register as VIPR filter - Validate input data - Delegate to service layer

class vipr_reflectometry.flow_models.postprocess.cluster.clustering.ClusterProcessor(app)

Bases: object

Main clustering orchestrator that coordinates the workflow.

Responsibilities: - Handle batch processing (multiple spectra) - Coordinate algorithm execution (delegates to algorithms module) - Coordinate visualization and simulation

This class is a pure orchestrator - it doesn’t implement algorithms itself.

process(data: Dict[str, Any], method: str = 'gmm', n_components: int | None = 5, min_cluster_size: int = 50, min_samples: int = 10, seed: int | None = None, n_init: int = 10, covariance_type: str = 'full', weight_concentration_prior: float | None = None, weight_concentration_prior_type: str = 'dirichlet_process', **_) Dict[str, Any]

Process clustering for a single spectrum.

Parameters:
  • data – Prediction results

  • method – Clustering method (‘hdbscan’ or ‘gmm’)

  • n_components – Number of GMM components

  • min_cluster_size – Minimum HDBSCAN cluster size

  • min_samples – Minimum HDBSCAN samples

Returns:

Enriched data with cluster information

vipr_reflectometry.flow_models.postprocess.cluster.clustering.apply_gmm(samples: ndarray, n_components: int | None = None, logger=None, *, seed: int | None = None, k_max: int = 8, reg_covar: float = 1e-06, n_init: int = 3, covariance_type: str = 'full', whiten: bool = True) tuple[ndarray, Dict[str, Any]]
Pure GMM clustering without pruning or noise detection.
  • K selection via BIC (if None)

  • Hard assignment via gmm.predict()

  • No -1/noise labels, no min_cluster_size filtering

  • Optional standardization for stable BIC-based K selection

Parameters:
  • samples – Parameter samples (N, D)

  • n_components – Fixed K (None = auto via BIC)

  • logger – Logger instance

  • seed – Random seed for reproducible GMM fitting

  • k_max – Maximum K to test for BIC

  • reg_covar – Covariance regularization

  • n_init – GMM initializations

  • whiten – Apply StandardScaler for equal parameter weighting

Returns:

  • labels: Cluster labels array (0 to K-1, never -1)

  • gmm_params: Dict with ‘weights’, ‘means’, ‘covariances’ (the k Gaussians)

Return type:

Tuple of (labels, gmm_params) where

vipr_reflectometry.flow_models.postprocess.cluster.clustering.apply_hdbscan(samples: ndarray, min_cluster_size: int, min_samples: int, logger=None, *, seed: int | None = None, whiten: bool = True) ndarray

HDBSCAN clustering with optional standardization.

Parameters:
  • samples – Array of shape (num_samples, num_params)

  • min_cluster_size – Minimum number of samples per cluster

  • min_samples – Minimum number of samples for core points

  • logger – Optional logger for debug messages

  • seed – Random seed for reproducibility (unused - HDBSCAN is deterministic)

  • whiten – Apply StandardScaler for equal parameter weighting

Returns:

Array of cluster labels (-1 for noise)

vipr_reflectometry.flow_models.postprocess.cluster.clustering.calculate_validation_metrics(samples: ndarray, cluster_labels: ndarray, logger=None) Dict[str, Any]

Calculate validation metrics on scaled features.

Metrics are calculated on standardized features to match the clustering feature space when whiten=True was used.

Parameters:
  • samples – Parameter samples (num_samples, num_params)

  • cluster_labels – Cluster assignments

  • logger – Optional logger for warnings

Returns:

Dict with validation metrics and quality assessment