Estimating HPC Resources

Pau Erola, PhD

Learning Outcomes

Understand key HPC resource concepts
Know how to run representative benchmarks of your code
Be able to use simple command-line profiling tools
Convert benchmark timings into realistic resource requests

Prerequisites

Basic familiarity with the command line
A computational task you want to benchmark

Understanding HPC Resources

Key Definitions

Core-hours ≈ Wall-clock time × Number of CPU cores

Wall-clock time: Elapsed time (what a stopwatch measures)
CPU time: Active processor time (excludes I/O waits)
Task: One independent run (one simulation, dataset, etc.)
Job: What you submit to the scheduler (may contain many tasks)

Example

A program runs 2 hours using 8 cores = 16 core-hours

CPU Time vs Wall Time

Single core: CPU time ≈ Wall time
Multi-core: CPU time ≈ Wall time × cores
I/O-heavy: Wall time > CPU time

# Example output from /usr/bin/time -v
Elapsed (wall clock) time: 0:02:15
User time (seconds): 120.45
System time (seconds): 12.33
Maximum resident set size (kbytes): 2048000

Understanding this relationship helps optimize resource requests

Why Measure Rather Than Guess?

❌ Common Mistakes

Ignoring HPC systems are (generally) faster. “My laptop takes 1 hour, so I need 100 hours for 100 runs”
Missing parallelization opportunities
No overhead for scheduling

✅ Best Practice

Benchmark on target system
Use representative data
Measure actual resource usage
Account for variability

Scaling Laws

Scaling Patterns

Linear O(n)

2 × data → 2 × time
Safe to extrapolate

Quadratic O(n²)

2 × data → 4 × time
All-pairs comparisons

Scaling Patterns: linear vs. quadratic

Testing Your Scaling

# Simple scaling test
./my_program --data 2percent_sample.csv   # Measure time
./my_program --data 4percent_sample.csv   # Should take ~2x if linear
./my_program --data 8percent_sample.csv   # Should take ~4x if linear

If 8% data takes 10× longer than 2% data, you don’t have linear scaling!

Parallel Strong vs Weak Scaling

Amdahl’s Law

Parallel Scaling Test

#!/bin/bash
dataset="my_data.csv"

for cores in 1 2 4 8 16; do
    echo "Testing with $cores cores..."
    /usr/bin/time -f "cores=$cores wall=%E cpu=%U+%S" \
        my_parallel_program --input $dataset --threads $cores \
        2>> scaling_results.log
done

Example Results

1 core: 120s
2 cores: 65s (92% efficiency)
4 cores: 35s (86% efficiency)
8 cores: 25s (60% efficiency)

Practical Benchmarking

The Benchmarking Workflow

Create representative test cases
Test scaling (e.g. 2%, 4%, 8% data sizes)
Extrapolate carefully with safety factors

Tip

Document environment (software versions, modules)
Run end-to-end tests (include all steps)

Essential Measurement Tools

GNU time

# Basic timing with detailed output
/usr/bin/time -v my_program input.dat

# Custom format for logging
/usr/bin/time -f 'wall=%E user=%U sys=%S maxRSS=%M' my_program input.dat

Key Outputs

Elapsed: Wall-clock time
User/System time: CPU time breakdown
Maximum resident set size: Peak memory (KB)

Example: Benchmark Script

Benchmarking my code with different data sizes

#!/bin/bash

for size in 2 4 8; do
    echo "Testing ${size}% data size..."
    /usr/bin/time -f 'size=${size}% run=$run wall=%E user=%U sys=%S maxRSS=%M' \
        my_simulation --input data_${size}pct.csv \
        --output results_${size}.dat \
        2>> benchmark_results.log
done

Allowing for sources of variability…

Particularly if your datasets are small, you may need to consider running the benchmark multiple times. For example, adding an extra nested loop:

    for run in {1..3}; do

Example: Image Processing

Scenario: Image processing pipeline

Load images from disk
Run object detection
Save results

Goal: Process 10,000 images, estimate resources needed

Step 1: Create Test Datasets

# Create subsets of your full dataset
mkdir test_data_5pct test_data_10pct test_data_15pct

# Randomly sample (adjust numbers for your case)
shuf -n  500 full_image_list.txt >  test_data_5pct/image_list.txt
shuf -n 1000 full_image_list.txt > test_data_10pct/image_list.txt
shuf -n 1500 full_image_list.txt > test_data_15pct/image_list.txt

Tip

Use random sampling to ensure representative subsets

Step 2: Run Benchmarks

#!/bin/bash
module load languages/python/3.7.12 

for size in 5 10 15; do
    echo "=== Testing ${size}% dataset ==="
    for run in {1..3}; do
        echo "Run $run..."
        /usr/bin/time -f "${size}pct run$run: wall=%E user=%U sys=%S maxRSS_MB=%M" \
            python image_pipeline.py \
            --input test_data_${size}pct/ \
            --output results_${size}pct_run${run}/ \
            2>> timing_results.log
    done
done

Step 3: Analyze Results

# Extract timing data
grep "wall=" timing_results.log

# Results:
# 5pct run1: wall=0:02:15 user=2:01.45 sys=0:12.33 maxRSS_MB=2048
# 5pct run2: wall=0:02:18 user=2:03.12 sys=0:13.01 maxRSS_MB=2051
# ...
# 15pct run1: wall=0:04:45 user=4:15.22 sys=0:28.11 maxRSS_MB=2055

Analysis:
5% data: ~135 seconds
10% data: ~285 seconds
15% data: ~580 seconds

Linear scaling confirmed!
100% estimate: ~2850 seconds = 27.5 minutes

If it is nor linear…

If the relationship is not linear, you can use Excel to add a trendline to your chart and choose to display the equation.

Benchmark with a quadratic trend

If it is nor linear…

A polynomial of order 2 or 3 is generally a good option for capturing non-linear trends.

Forecasting the runtime with all the data

Benchmarking on HPC Systems

Interactive Benchmarking

Important

Always benchmark on compute nodes, not login nodes!

# Request an interactive session
salloc -p short -N 1 -c 4 --mem=8G --time=00:30:00

# Once allocated, run your benchmark
srun --cpu-bind=cores /usr/bin/time -v my_program test_data.csv

Batch Job Benchmarking

#!/bin/bash
#SBATCH --account=ABC012345
#SBATCH --partition=test
#SBATCH --job-name=benchmark
#SBATCH --output=benchmark_out_%j.txt
#SBATCH --error=benchmark_err_%j.txt
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=00:30:00

module load languages/python/3.7.12

export OMP_NUM_THREADS=8  # Match requested CPUs

echo "Starting benchmark at $(date)"
echo "Running on node: $(hostname)"

/usr/bin/time -v srun --cpu-bind=cores \
    python my_program.py --input test_10pct.csv --threads 8

Analyzing Completed Jobs

# Get job statistics
sacct -j JOBID --format=JobID,Elapsed,TotalCPU,MaxRSS,ReqCPUS,State -P

# Example output:
# JobID|Elapsed|TotalCPU|MaxRSS|ReqCPUS|State
# 12345|00:15:30|01:58:45|2048000K|8|COMPLETED

Resource Analysis:
Wall time: 15.5 minutes
CPU time: 118.75 minutes
Peak memory: ~2GB
Core-hours: 15.5 min × 8 cores = 2.07 core-hours

Estimating GPU Time Requirements

The same benchmarking principles used for CPU-based tasks apply generally to GPU workloads.

However, GPUs often have longer warm-up times. Consider factors like batch size, data transfer overhead, and GPU memory usage, which can significantly affect runtime.

To get accurate estimates, benchmarks should be run with sufficiently large datasets to reflect realistic performance.

Isambard-AI: using 1 GPU for 1 hour would be counted as 0.25 hours of resource use

Converting to Resource Requests

From Benchmarks to Core-Hours

Pick scaling model from test results
Extrapolate to full problem size
Choose optimal parallel configuration
Calculate core-hours per run
Multiply by number of runs
Add safety factors

Worked Example: Climate Study

Scenario

Program: climate_sim
Test data: 20% of full grid (1 year sim)
Hardware: 8 cores, 75% parallel efficiency
Measured time: 45 minutes wall time
Scaling: Linear with grid size
Parameter sets: 50 different configurations

Resource Calculation

Note

Step-by-step calculation:

Full problem: 5× larger (20% → 100%)
Time per run: 45 min × 5 = 225 min = 3.75 hours
Core-hours per run: 3.75 × 8 = 30 core-hours

Base cost: 30 × 50 runs = 1,500 core-hours

Safety factors:

1.4× (failures)
2.0× (development)
1.2× (scaling error)

Total request: 1,500 × 3.6 = 5,400 core-hours

Questions?

For queries related to these training materials jgi-training@bristol.ac.uk

For other HPC queries hpc-help@bristol.ac.uk

Happy computing! 🚀