Lattice QCD Benchmarks

This is a set of lattice QCD benchmarks carried out on a variety of HPC clusters, including

QPACE 4 (Fujitsu A64FX CPU cluster) at University of Regensburg (UR), Germany
Fritz (Intel Ice Lake CPU cluster) at Regional Computer Center Erlangen (RRZE), Germany
JUWELS Booster (NVIDIA A100 GPU cluster) at Jülich Supercomputing Centre (JSC), Germany
Fugaku (Fujitsu A64FX CPU supercomputer with ToFu interconnect) at Riken Center for Computational Science (R-CCS), Japan

For support with running these codes you might contact simulations@punch4nfdi.de.

Bridge++

Bridge++ is a general-purpose code set for lattice QCD simulations aiming at a readable, extensible, and portable code while keeping practical high performance.

Benchmark results on the Fugaku supercomputer are published in arxiv:2303.05883. Here a node refers to a single Fujitsu A64FX CPU on Fugaku. Each node runs 4 MPI processes in parallel.

Wilson Dirac

Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Results:

Domain wall

Benchmark of the domain-wall Dirac operator (D dag D) applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Results:

Conjugate Gradient solver for domain-wall fermions

Benchmark of a Conjugate Gradient (CG) solver for domain-wall fermions. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 8 for each benchmark.

Results:

CG solver for domain-wall fermions weak scaling on Fugaku

Grid

The Grid lattice QCD framework comes with a series of tests and benchmarks. Tests and benchmarks are configured by command line parameters. Configuration includes, e.g.,

global lattice volume
partitioning of the global volume amongst processing elements
computation and communication options

We define a processing element (PE) as follows:

QPACE4: one Fujitsu A64FX CPU chip (one compute node hosts one CPU chip)
Fritz: two Intel Ice Lake CPU chips (one compute node hosts two CPU chips, arranged as cache-coherent distributed shared memory system)
JUWELS Booster: one NVIDIA A100 GPU chip (one compute node hosts four GPU chips interconnected by NVLink)

Benchmark_wilson

Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field (Dslash). We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Global lattice volumes:

Number of PEs	Weak scaling	Strong scaling
1	64 x 64 x 32 x 32	-
2	64 x 64 x 64 x 32	-
4	64 x 64 x 64 x 64	-
8	128 x 64 x 64 x 64	128 x 64 x 64 x 64
16	128 x 128 x 64 x 64	128 x 64 x 64 x 64
32	128 x 128 x 128 x 64	128 x 64 x 64 x 64

Results:

Benchmark_dwf

Benchmark of the performance-relevant part of the domain-wall Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 16 for each benchmark.

Global lattice volumes:

Number of PEs	Weak scaling	Strong scaling
1	32 x 32 x 16 x 16 x 16	-
2	32 x 32 x 32 x 16 x 16	-
4	32 x 32 x 32 x 32 x 16	-
8	64 x 32 x 32 x 32 x 16	64 x 32 x 32 x 32 x 16
16	64 x 64 x 32 x 32 x 16	64 x 32 x 32 x 32 x 16
32	64 x 64 x 64 x 32 x 16	64 x 32 x 32 x 32 x 16

Results:

Test_dwf_mixedcg_prec

Solve time of FP32/FP64 mixed-precision conjugate gradient solver for domain-wall fermions. We show the performance of weak and strong MPI scaling. The lattice extension in the 5-direction is 16 for each benchmark.

Global lattice volumes:

Number of PEs	Weak scaling	Strong scaling
1	32 x 32 x 16 x 16 x 16	-
2	32 x 32 x 32 x 16 x 16	-
4	32 x 32 x 32 x 32 x 16	-
8	64 x 32 x 32 x 32 x 16	64 x 32 x 32 x 32 x 16
16	64 x 64 x 32 x 32 x 16	64 x 32 x 32 x 32 x 16
32	64 x 64 x 64 x 32 x 16	64 x 32 x 32 x 32 x 16

Results:

SIMULATeQCD

SIMULATeQCD is aiming for lattice QCD calculations on multiple GPUs. It currently supports quenched and dynaimcal staggered quarks. Below we benchmark the Highly Improved Staggered Quarks (HISQ) Dslash operator.

We define a processing element (PE) as follows:

Perlmutter: one A 100 NVIDIA GPU chip (one compute node has 4 A100 GPUs)

Global lattice volumens:

Number of PEs	Weak scaling	Strong scaling
1	32 x 32 x 32 x 32	-
4	64 x 64 x 32 x 32	96 x 96 x 96 x 96
8	64 x 64 x 64 x 32	96 x 96 x 96 x 96
16	64 x 64 x 64 x 64	96 x 96 x 96 x 96
32	128 x 64 x 64 x 64	96 x 96 x 96 x 96
64	128 x 128 x 64 x 64	96 x 96 x 96 x 96
128	128 x 128 x 128 x 64	96 x 96 x 96 x 96
256	128 x 128 x 128 x 128	96 x 96 x 96 x 96

Results: