Lattice QCD Benchmarks
This is a set of lattice QCD benchmarks carried out on a variety of HPC clusters, including
- QPACE 4 (Fujitsu A64FX CPU cluster) at University of Regensburg (UR), Germany
- Fritz (Intel Ice Lake CPU cluster) at Regional Computer Center Erlangen (RRZE), Germany
- JUWELS Booster (NVIDIA A100 GPU cluster) at Jülich Supercomputing Centre (JSC), Germany
- Fugaku (Fujitsu A64FX CPU supercomputer with ToFu interconnect) at Riken Center for Computational Science (R-CCS), Japan
For support with running these codes you might contact simulations@punch4nfdi.de.
Bridge++
Bridge++ is a general-purpose code set for lattice QCD simulations aiming at a readable, extensible, and portable code while keeping practical high performance.
Benchmark results on the Fugaku supercomputer are published in arxiv:2303.05883. Here a node refers to a single Fujitsu A64FX CPU on Fugaku. Each node runs 4 MPI processes in parallel.
Wilson Dirac
Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).
Results:
Domain wall
Benchmark of the domain-wall Dirac operator (D dag D) applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).
Results:
Conjugate Gradient solver for domain-wall fermions
Benchmark of a Conjugate Gradient (CG) solver for domain-wall fermions. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 8 for each benchmark.
Results:
Grid
The Grid lattice QCD framework comes with a series of tests and benchmarks. Tests and benchmarks are configured by command line parameters. Configuration includes, e.g.,
- global lattice volume
- partitioning of the global volume amongst processing elements
- computation and communication options
We define a processing element (PE) as follows:
- QPACE4: one Fujitsu A64FX CPU chip (one compute node hosts one CPU chip)
- Fritz: two Intel Ice Lake CPU chips (one compute node hosts two CPU chips, arranged as cache-coherent distributed shared memory system)
- JUWELS Booster: one NVIDIA A100 GPU chip (one compute node hosts four GPU chips interconnected by NVLink)
Benchmark_wilson
Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field (Dslash). We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).
Global lattice volumes:
Number of PEs | Weak scaling | Strong scaling |
---|---|---|
1 | 64 x 64 x 32 x 32 | - |
2 | 64 x 64 x 64 x 32 | - |
4 | 64 x 64 x 64 x 64 | - |
8 | 128 x 64 x 64 x 64 | 128 x 64 x 64 x 64 |
16 | 128 x 128 x 64 x 64 | 128 x 64 x 64 x 64 |
32 | 128 x 128 x 128 x 64 | 128 x 64 x 64 x 64 |
Results:
Benchmark_dwf
Benchmark of the performance-relevant part of the domain-wall Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 16 for each benchmark.
Global lattice volumes:
Number of PEs | Weak scaling | Strong scaling |
---|---|---|
1 | 32 x 32 x 16 x 16 x 16 | - |
2 | 32 x 32 x 32 x 16 x 16 | - |
4 | 32 x 32 x 32 x 32 x 16 | - |
8 | 64 x 32 x 32 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
16 | 64 x 64 x 32 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
32 | 64 x 64 x 64 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
Results:
Test_dwf_mixedcg_prec
Solve time of FP32/FP64 mixed-precision conjugate gradient solver for domain-wall fermions. We show the performance of weak and strong MPI scaling. The lattice extension in the 5-direction is 16 for each benchmark.
Global lattice volumes:
Number of PEs | Weak scaling | Strong scaling |
---|---|---|
1 | 32 x 32 x 16 x 16 x 16 | - |
2 | 32 x 32 x 32 x 16 x 16 | - |
4 | 32 x 32 x 32 x 32 x 16 | - |
8 | 64 x 32 x 32 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
16 | 64 x 64 x 32 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
32 | 64 x 64 x 64 x 32 x 16 | 64 x 32 x 32 x 32 x 16 |
Results:
- Test_dwf_mixedcg_prec on QPACE 4
- Test_dwf_mixedcg_prec on Fritz
- Test_dwf_mixedcg_prec on JUWELS Booster
SIMULATeQCD
SIMULATeQCD is aiming for lattice QCD calculations on multiple GPUs. It currently supports quenched and dynaimcal staggered quarks. Below we benchmark the Highly Improved Staggered Quarks (HISQ) Dslash operator.
We define a processing element (PE) as follows:
- Perlmutter: one A 100 NVIDIA GPU chip (one compute node has 4 A100 GPUs)
Global lattice volumens:
Number of PEs | Weak scaling | Strong scaling |
---|---|---|
1 | 32 x 32 x 32 x 32 | - |
4 | 64 x 64 x 32 x 32 | 96 x 96 x 96 x 96 |
8 | 64 x 64 x 64 x 32 | 96 x 96 x 96 x 96 |
16 | 64 x 64 x 64 x 64 | 96 x 96 x 96 x 96 |
32 | 128 x 64 x 64 x 64 | 96 x 96 x 96 x 96 |
64 | 128 x 128 x 64 x 64 | 96 x 96 x 96 x 96 |
128 | 128 x 128 x 128 x 64 | 96 x 96 x 96 x 96 |
256 | 128 x 128 x 128 x 128 | 96 x 96 x 96 x 96 |
Results: