PUNCH4NFDI results documentation

Lattice QCD Benchmarks

This is a set of lattice QCD benchmarks carried out on a variety of HPC clusters, including

For support with running these codes you might contact simulations@punch4nfdi.de.

Bridge++

Bridge++ is a general-purpose code set for lattice QCD simulations aiming at a readable, extensible, and portable code while keeping practical high performance.

Benchmark results on the Fugaku supercomputer are published in arxiv:2303.05883. Here a node refers to a single Fujitsu A64FX CPU on Fugaku. Each node runs 4 MPI processes in parallel.

Wilson Dirac

Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Results:

Domain wall

Benchmark of the domain-wall Dirac operator (D dag D) applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Results:

Conjugate Gradient solver for domain-wall fermions

Benchmark of a Conjugate Gradient (CG) solver for domain-wall fermions. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 8 for each benchmark.

Results:

Grid

The Grid lattice QCD framework comes with a series of tests and benchmarks. Tests and benchmarks are configured by command line parameters. Configuration includes, e.g.,

We define a processing element (PE) as follows:

Benchmark_wilson

Benchmark of the hopping term of the Wilson-Dirac operator applied to a fermion field (Dslash). We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP).

Global lattice volumes:

Number of PEs Weak scaling Strong scaling
1 64 x 64 x 32 x 32 -
2 64 x 64 x 64 x 32 -
4 64 x 64 x 64 x 64 -
8 128 x 64 x 64 x 64 128 x 64 x 64 x 64
16 128 x 128 x 64 x 64 128 x 64 x 64 x 64
32 128 x 128 x 128 x 64 128 x 64 x 64 x 64

Results:

Benchmark_dwf

Benchmark of the performance-relevant part of the domain-wall Dirac operator applied to a fermion field. We show the performance of weak and strong MPI scaling in single precision (SP) and double precision (DP). The lattice extension in the 5-direction is 16 for each benchmark.

Global lattice volumes:

Number of PEs Weak scaling Strong scaling
1 32 x 32 x 16 x 16 x 16 -
2 32 x 32 x 32 x 16 x 16 -
4 32 x 32 x 32 x 32 x 16 -
8 64 x 32 x 32 x 32 x 16 64 x 32 x 32 x 32 x 16
16 64 x 64 x 32 x 32 x 16 64 x 32 x 32 x 32 x 16
32 64 x 64 x 64 x 32 x 16 64 x 32 x 32 x 32 x 16

Results:

Test_dwf_mixedcg_prec

Solve time of FP32/FP64 mixed-precision conjugate gradient solver for domain-wall fermions. We show the performance of weak and strong MPI scaling. The lattice extension in the 5-direction is 16 for each benchmark.

Global lattice volumes:

Number of PEs Weak scaling Strong scaling
1 32 x 32 x 16 x 16 x 16 -
2 32 x 32 x 32 x 16 x 16 -
4 32 x 32 x 32 x 32 x 16 -
8 64 x 32 x 32 x 32 x 16 64 x 32 x 32 x 32 x 16
16 64 x 64 x 32 x 32 x 16 64 x 32 x 32 x 32 x 16
32 64 x 64 x 64 x 32 x 16 64 x 32 x 32 x 32 x 16

Results:

SIMULATeQCD

SIMULATeQCD is aiming for lattice QCD calculations on multiple GPUs. It currently supports quenched and dynaimcal staggered quarks. Below we benchmark the Highly Improved Staggered Quarks (HISQ) Dslash operator.

We define a processing element (PE) as follows:

Global lattice volumens:

Number of PEs Weak scaling Strong scaling
1 32 x 32 x 32 x 32 -
4 64 x 64 x 32 x 32 96 x 96 x 96 x 96
8 64 x 64 x 64 x 32 96 x 96 x 96 x 96
16 64 x 64 x 64 x 64 96 x 96 x 96 x 96
32 128 x 64 x 64 x 64 96 x 96 x 96 x 96
64 128 x 128 x 64 x 64 96 x 96 x 96 x 96
128 128 x 128 x 128 x 64 96 x 96 x 96 x 96
256 128 x 128 x 128 x 128 96 x 96 x 96 x 96

Results: