User Guide

Pipeline Overview

The Roman Galaxy Redshift Survey (GRS) covariance mocks pipeline generates mock galaxy catalogs from AbacusSummit N-body simulations as part of the Roman GRS Project Infrastructure Team (PIT) analysis framework. The pipeline consists of modular components that handle different aspects of the mock generation process and runs on the Perlmutter system at NERSC.

Architecture

The modular pipeline consists of 5 core modules:

  • data_loader - Halo catalog loading and filtering with MPI slab decomposition

  • galaxy_generator - Galaxy population modeling using rgrspit_diffsky

  • hdf5_writer - Parallel HDF5 output operations for data storage

  • mpi_setup - MPI and JAX initialization for distributed computing

  • utils - Common utility functions for path validation and filename generation

The production management system adds:

  • production_config - YAML configuration validation and hierarchical inheritance

  • production_manager - SQLite job tracking and SLURM array orchestration

Configuration

Default Configuration Constants

The pipeline uses the following default configuration constants defined in covariance_mocks.__init__:

  • CURRENT_PHASE = "ph3000" - Default phase identifier

  • CURRENT_REDSHIFT = "z1.100" - Default redshift string

  • CURRENT_Z_OBS = 1.1 - Observational redshift value

  • LGMP_MIN = 10.0 - Minimum log10 halo mass threshold

  • SIMULATION_BOX = "AbacusSummit_small_c000" - Default simulation box

These can be overridden by modifying the constants or passing different values to the pipeline functions.

Data Processing

Halo Catalog Loading

The pipeline loads AbacusSummit halo catalogs and applies several processing steps:

  1. Path Construction: Builds full paths to halo catalog directories

  2. Mass Filtering: Applies minimum mass threshold (default: log10(M) >= 10.0)

  3. Coordinate Transformation: Converts positions from [-Lbox/2, Lbox/2] to [0, Lbox]

  4. Slab Decomposition: Distributes halos across MPI ranks by y-coordinate

  5. Test Mode Support: Optional limitation to N halos for testing

Galaxy Generation

Galaxy population uses the rgrspit_diffsky package:

  1. Reproducible Random Seeds: Fixed random key (0) ensures consistent results

  2. Halo Population: Populates halos with central and satellite galaxies

  3. Stellar Mass Assignment: Assigns stellar masses based on halo properties

  4. Subhalo Modeling: Includes synthetic subhalo population for satellites

Output Management

The pipeline supports both single-process and parallel HDF5 output:

Single Process: - Direct HDF5 file creation - All data written by single rank

Parallel MPI: - Collective I/O operations - Coordinated writes across all ranks - Efficient handling of large datasets

HPC Integration

Single Mock SLURM Job Submission

The pipeline runs on HPC environments with SLURM job scheduling. Example job script for single mock generation:

#!/bin/bash
#SBATCH --job-name=covariance_mocks
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
#SBATCH --output=mock_generation_%j.out

source scripts/load_env.sh

srun python scripts/generate_single_mock.py nersc /output/path

Production SLURM Array Jobs

For large-scale productions, the system uses SLURM array jobs:

#!/bin/bash
#SBATCH --job-name=production_mock_gen
#SBATCH --array=1-500
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
#SBATCH --output=production_logs/job_%A_%a.out

source scripts/load_env.sh

# Production manager handles job parameters based on array index
python scripts/run_production.py execute my_production $SLURM_ARRAY_TASK_ID

MPI Scaling

The pipeline scales across multiple nodes:

  • Slab Decomposition: Halos distributed by spatial coordinates

  • Independent Processing: Each rank processes its assigned halos

  • Collective Output: Coordinated parallel HDF5 writes

  • Memory Usage: Only loads data needed for each rank’s slab

Production Management

For large-scale mock generation productions (thousands of jobs), use the production management system with a three-stage workflow:

Three-Stage Workflow

Stage 1: Initialize Production

# Create production structure and validate configuration
python scripts/run_production.py init alpha config/productions/alpha.yaml

This creates the production directory structure:

/productions/alpha_v1.0/
├── catalogs/           # Generated HDF5 catalogs
├── scripts/            # SLURM scripts (generated in Stage 2)
├── logs/              # Job execution logs
├── production.yaml    # Production configuration
└── production.db      # SQLite job tracking database

Stage 2: Generate SLURM Scripts (Optional)

# Generate scripts for inspection before submission
python scripts/run_production.py stage alpha

This creates SLURM job scripts in the scripts/ directory that can be reviewed before submission.

Stage 3: Submit Jobs

# Submit pre-generated scripts to SLURM
python scripts/run_production.py submit alpha

# Monitor progress
python scripts/run_production.py status alpha

# Retry failed jobs
python scripts/run_production.py retry alpha

Directory Structure

Productions use clean directory organization:

  • Production naming: /productions/production_version/ (e.g., /productions/alpha_v1.0/)

  • No redundant prefixes: Production names are simplified without redundant prefixes

  • Organized subdirectories: Separate directories for catalogs, scripts, logs, and metadata

  • Job tracking: SQLite database maintains job state and execution history

Troubleshooting

Common Issues

Environment Problems: - Ensure source scripts/load_env.sh is run before execution - Verify CONDA_ENV environment variable is set - Check that all required modules are loaded

MPI Issues: - Verify MPI implementation is available (OpenMPI/MPICH) - Check that h5py is compiled with parallel HDF5 support - Ensure consistent JAX configuration across ranks

Memory Issues: - Large halo catalogs may require more memory per rank - Consider reducing the number of MPI ranks per node - Use test mode (n_gen parameter) for smaller datasets

File I/O Problems: - Verify write permissions to output directory - Check available disk space - Ensure parallel file system supports concurrent writes

Performance Optimization

MPI Configuration: - Use appropriate number of ranks per node based on memory requirements - Consider NUMA topology for optimal performance - Test different slab decomposition strategies

JAX Optimization: - Enable GPU acceleration when available - Configure JAX memory allocation settings - Use appropriate precision settings (float32 vs float64)

I/O Configuration: - Use parallel file systems (Lustre, GPFS) - Configure HDF5 chunking and compression - Consider collective I/O vs independent writes