==================================================================
=== ===
=== GENESIS Distributed Memory Benchmarks ===
=== ===
=== QCD2 ===
=== ===
=== Conjugate Gradient iteration in SU(3) lattice gauge ===
=== theory with Kogut-Susskind fermions ===
=== ===
=== Original author: John Merlin ===
=== Modified by : Ivan Wolton ===
=== PARMACS macros : Vladimir Getov ===
=== Department of Electronics and Computer Science ===
=== University of Southampton ===
=== Southampton SO9 5NH, U.K. ===
=== fax.:+44-703-593045 e-mail:icw@uk.ac.soton.ecs ===
=== vsg@uk.ac.soton.ecs ===
=== ===
=== Copyright: SNARC, University of Southampton ===
=== ===
=== Last update: June 1993; Release: 2.2 ===
=== ===
==================================================================
1. Description
--------------
This benchmark consists of solving a large, sparse system of linear
equations using conjugate gradient iteration. The equations are derived
from a lattice gauge theory simulation using dynamical Kogut-Susskind
fermions. Conjugate gradient methods form the core of several important
algorithms for lattice gauge theory with fermions. Supercomputer
performance is essential for such problems as the inclusion of dynamical
fermions increases the computational effort required by several orders
of magnitude over the 'quenched' approximation. (The quenched
approximation is used in the QCD1 benchmark)
Simulations are defined on four-dimensional lattices which are discrete
approximations to continuum space-time. The basic variables are 3 by 3
complex matrices. Four such matrices are associated with every lattice site.
The benchmark takes the common approach of updating the variables on
all even sites, and then on all odd sites, on alternate steps. Updating
a site variable requires a number of matrix multiplications and involves
matrices from neighbouring sites. Almost all the arithmetic operations
are vectorizable. However, in order to achieve this vectorization an
overhead is incurred in internal shifts of neighbouring matrices, which
can become a significant part of the execution time.
The parallel version of the program distibutes the spatial dimension of
the lattice over a cuboidal process grid. Communications involve both
the shifting of matrices from neighbouring processors and a global
summation followed by a broadcast.
2. Operating Instructions
-------------------------
Changing problem size and number of processors:
----------------------------------------------
The problem is based on a 4-dimensional space-time lattice of size:
N = NX**3 * NT.
For the purposes of the benchmark, NX & NT are specified as integer powers
of 2, so that: NX = 2**LOGNX , NT = 2**LOGNT
In the parallel version of the program the number of processors (NP) over
which the spatial dimensions of the lattice are distributed is determined
by the parameter LOGP, which is the log to base 2 of the required number
of processors, ie. NP = 2**LOGP.
The specified number of processors is configured as a 3D grid internally
within the program.
NP = NPX * NPY * NPZ
Where NPX, NPY, NPZ are all powers of two, NPX >= NPY >= NPZ.
The local lattice size on each processor is then:
n = NT * (NX/NPX) * (NX/NPY) * (NX/NPZ)
Suggested Problem Sizes:
------------------------
It is recommended that the benchmark is run with four standard problem
sizes: 4 * 8**3, 8 * 16**3, 16 * 32**3 and 16 * 64**3.
The input parameters and total memory requirement for array storage for
each problem size is given in the following table:
Problem Size LOGNT LOGNX Approx Memory(Mbyte)
4 * 8**3 2 3 2
8 * 16**3 3 4 32
16 * 32**3 4 5 512
16 * 64**3 4 6 4096
Compiling and Running the Benchmark:
1) Choose problem size and number of processors, edit the include file
qcd2.inc to set the appropriate parameters.
2) To compile and link the benchmark type: `make' for the distributed
version or `make slave' for the single-node version.
3) If any of the parameters in the include files are changed,
the code has to be recompiled. The make-file will automatically
send to the compiler only affected files.
4) On some systems it may be necessary to allocate the appropriate
resources before running the benchmark, eg. on the iPSC/860
to reserve a cube of 8 processors, type: getcube -t8
5) To run either sequential or distributed version of the benchmark,
type: qcd2
The progress of the benchmark execution can be monitored via
the standard output, whilst a permanent copy of the benchmark output
is written to a file called 'result'.
6) If the run is successful and a permanent record is required, the
file 'result' should be copied to another file before the next run
overwrites it.
3. Hints for Optimisation (Blockshift versus indirect addressing)
-----------------------------------------------------------------
Two routines are provided for the shift operation, blockshift and
shiftvec. Blockshift shifts coherent blocks corresponding to a given
lattice direction. The problem is that the block length is rather small
for the t & x directions and so with large vector startups the vector
efficiency is poor. Hence in these directions it is more efficient to
use the indirect addressing version of the shift routine shiftvec.
The shift routines are called from the routine dvec which by default
uses shiftvec in the t and x directions and blockshift in the y & z
directions. For best performance with smaller lattices, blockshift should
be used in the y direction.
4. Accuracy Check
-----------------
The output results are best characterised by the total energy per
lattice point (output in columnn 3). The program can be considered to
have run successfully if the following two conditions are met.
1) The total energy should be constant to 5 decimal places for each
iteration (a small variation in the final 6th place is allowable).
2) This constant value should be close to 3.0
Unfortunately it is difficult to be more precise as the fermion and gauge
fields are initialised by a random number generator. Consequently the exact
value of the total energy is then dependent on the number of processors and
the problem size.
$Id: ReadMe,v 1.2 1994/04/20 16:50:10 igl Rel igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.