User Guide

Pipeline Overview

The Roman Galaxy Redshift Survey (GRS) covariance mocks pipeline generates mock galaxy catalogs from AbacusSummit N-body simulations as part of the Roman GRS Project Infrastructure Team (PIT) analysis framework. The pipeline consists of modular components that handle different aspects of the mock generation process and runs on the Perlmutter system at NERSC.

Architecture

The modular pipeline consists of 5 core modules:

data_loader - Halo catalog loading and filtering with MPI slab decomposition
galaxy_generator - Galaxy population modeling using rgrspit_diffsky
hdf5_writer - Parallel HDF5 output operations for data storage
mpi_setup - MPI and JAX initialization for distributed computing
utils - Common utility functions for path validation and filename generation

The production management system adds:

production_config - YAML configuration validation and hierarchical inheritance
production_manager - SQLite job tracking and SLURM array orchestration

Configuration

Default Configuration Constants

The pipeline uses the following default configuration constants defined in covariance_mocks.__init__:

CURRENT_PHASE = "ph3000" - Default phase identifier
CURRENT_REDSHIFT = "z1.100" - Default redshift string
CURRENT_Z_OBS = 1.1 - Observational redshift value
LGMP_MIN = 10.0 - Minimum log10 halo mass threshold
SIMULATION_BOX = "AbacusSummit_small_c000" - Default simulation box

These can be overridden by modifying the constants or passing different values to the pipeline functions.

Data Processing

Halo Catalog Loading

The pipeline loads AbacusSummit halo catalogs and applies several processing steps:

Path Construction: Builds full paths to halo catalog directories
Mass Filtering: Applies minimum mass threshold (default: log10(M) >= 10.0)
Coordinate Transformation: Converts positions from [-Lbox/2, Lbox/2] to [0, Lbox]
Slab Decomposition: Distributes halos across MPI ranks by y-coordinate
Test Mode Support: Optional limitation to N halos for testing

Galaxy Generation

Galaxy population uses the rgrspit_diffsky package:

Reproducible Random Seeds: Fixed random key (0) ensures consistent results
Halo Population: Populates halos with central and satellite galaxies
Stellar Mass Assignment: Assigns stellar masses based on halo properties
Subhalo Modeling: Includes synthetic subhalo population for satellites

Output Management

The pipeline supports both single-process and parallel HDF5 output:

Single Process: - Direct HDF5 file creation - All data written by single rank

Parallel MPI: - Collective I/O operations - Coordinated writes across all ranks - Efficient handling of large datasets

HPC Integration

Single Mock SLURM Job Submission

The pipeline runs on HPC environments with SLURM job scheduling. Example job script for single mock generation:

#!/bin/bash
#SBATCH --job-name=covariance_mocks
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
#SBATCH --output=mock_generation_%j.out

source scripts/load_env.sh

srun python scripts/generate_single_mock.py nersc /output/path

Production SLURM Array Jobs

For large-scale productions, the system uses SLURM array jobs:

#!/bin/bash
#SBATCH --job-name=production_mock_gen
#SBATCH --array=1-500
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --time=02:00:00
#SBATCH --output=production_logs/job_%A_%a.out

source scripts/load_env.sh

# Production manager handles job parameters based on array index
python scripts/run_production.py execute my_production $SLURM_ARRAY_TASK_ID

MPI Scaling

The pipeline scales across multiple nodes:

Slab Decomposition: Halos distributed by spatial coordinates
Independent Processing: Each rank processes its assigned halos
Collective Output: Coordinated parallel HDF5 writes
Memory Usage: Only loads data needed for each rank’s slab

Production Management

For large-scale mock generation productions (thousands of jobs), use the production management system with a three-stage workflow:

Three-Stage Workflow

Stage 1: Initialize Production

# Create production structure and validate configuration
python scripts/run_production.py init alpha config/productions/alpha.yaml

This creates the production directory structure:

/productions/alpha_v1.0/
├── catalogs/           # Generated HDF5 catalogs
├── scripts/            # SLURM scripts (generated in Stage 2)
├── logs/              # Job execution logs
├── production.yaml    # Production configuration
└── production.db      # SQLite job tracking database

Stage 2: Generate SLURM Scripts (Optional)

# Generate scripts for inspection before submission
python scripts/run_production.py stage alpha

This creates SLURM job scripts in the scripts/ directory that can be reviewed before submission.

Stage 3: Submit Jobs

# Submit pre-generated scripts to SLURM
python scripts/run_production.py submit alpha

# Monitor progress
python scripts/run_production.py status alpha

# Retry failed jobs
python scripts/run_production.py retry alpha

Directory Structure

Productions use clean directory organization:

Production naming: /productions/production_version/ (e.g., /productions/alpha_v1.0/)
No redundant prefixes: Production names are simplified without redundant prefixes
Organized subdirectories: Separate directories for catalogs, scripts, logs, and metadata
Job tracking: SQLite database maintains job state and execution history

Troubleshooting

Common Issues

Environment Problems: - Ensure source scripts/load_env.sh is run before execution - Verify CONDA_ENV environment variable is set - Check that all required modules are loaded

MPI Issues: - Verify MPI implementation is available (OpenMPI/MPICH) - Check that h5py is compiled with parallel HDF5 support - Ensure consistent JAX configuration across ranks

Memory Issues: - Large halo catalogs may require more memory per rank - Consider reducing the number of MPI ranks per node - Use test mode (n_gen parameter) for smaller datasets

File I/O Problems: - Verify write permissions to output directory - Check available disk space - Ensure parallel file system supports concurrent writes

Performance Optimization

MPI Configuration: - Use appropriate number of ranks per node based on memory requirements - Consider NUMA topology for optimal performance - Test different slab decomposition strategies

JAX Optimization: - Enable GPU acceleration when available - Configure JAX memory allocation settings - Use appropriate precision settings (float32 vs float64)

I/O Configuration: - Use parallel file systems (Lustre, GPFS) - Configure HDF5 chunking and compression - Consider collective I/O vs independent writes