GBIS Benchmark Header File: qcd2

   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                           QCD2                             ===
   ===                                                            ===
   ===     Conjugate Gradient iteration in SU(3) lattice gauge    ===
   ===              theory with Kogut-Susskind fermions           ===
   ===                                                            ===
   ===     Original author:        John Merlin                    ===
   ===     Modified by    :        Ivan Wolton                    ===
   ===     PARMACS macros :        Vladimir Getov                 ===
   ===     Department of Electronics and Computer Science         ===
   ===               University of Southampton                    ===
   ===               Southampton SO9 5NH, U.K.                    ===
   ===     fax.:+44-703-593045   e-mail:icw@uk.ac.soton.ecs       ===
   ===                                  vsg@uk.ac.soton.ecs       ===
   ===                                                            ===
   ===     Copyright: SNARC, University of Southampton            ===
   ===                                                            ===
   ===          Last update: June 1993; Release: 2.2              ===
   ===                                                            ===
   ==================================================================


1. Description
--------------
This benchmark consists of solving a large, sparse system of linear
equations using conjugate gradient iteration. The equations are derived
from a lattice gauge theory simulation using dynamical Kogut-Susskind 
fermions. Conjugate gradient methods form the core of several important
algorithms for lattice gauge theory with fermions. Supercomputer 
performance is essential for such problems as the inclusion of dynamical
fermions increases the computational effort required by several orders 
of magnitude over the 'quenched' approximation. (The quenched 
approximation is used in the QCD1 benchmark)

Simulations are defined on four-dimensional lattices which are discrete 
approximations to continuum space-time. The basic variables are 3 by 3 
complex matrices. Four such matrices are associated with every lattice site.

The benchmark takes the common approach of updating the variables on
all even sites, and then on all odd sites, on alternate steps. Updating 
a site variable requires a number of matrix multiplications and involves
matrices from neighbouring sites. Almost all the arithmetic operations 
are vectorizable. However, in order to achieve this vectorization an 
overhead is incurred in internal shifts of neighbouring matrices, which 
can become a significant part of the execution time.

The parallel version of the program distibutes the spatial dimension of 
the lattice over a cuboidal process grid. Communications involve both 
the shifting of matrices from neighbouring processors and a global 
summation followed by a broadcast. 


2. Operating Instructions
-------------------------

Changing problem size and number of processors:
----------------------------------------------

The problem is based on a 4-dimensional space-time lattice of size:
 
        N = NX**3 * NT.

For the purposes of the benchmark, NX & NT are specified as integer powers
of 2, so that:    NX = 2**LOGNX ,  NT = 2**LOGNT

In the parallel version of the program the number of processors (NP) over
which the spatial dimensions of the lattice are distributed is determined 
by the parameter LOGP, which is the log to base 2 of the required number 
of processors, ie.  NP = 2**LOGP.
 
The specified number of processors is configured as a 3D grid internally
within the program.
 
      NP = NPX * NPY * NPZ

Where NPX, NPY, NPZ are all powers of two, NPX >= NPY >= NPZ.

The local lattice size on each processor is then:

      n = NT * (NX/NPX) * (NX/NPY) * (NX/NPZ) 



Suggested Problem Sizes:
------------------------
 
It is recommended that the benchmark is run with four standard problem
sizes: 4 * 8**3, 8 * 16**3, 16 * 32**3 and 16 * 64**3.
The input parameters and total memory requirement for array storage for 
each problem size is given in the following table:

    Problem Size	LOGNT		LOGNX		Approx Memory(Mbyte)
       4 *  8**3	  2		  3		     2
       8 * 16**3	  3		  4		    32
      16 * 32**3	  4		  5		   512
      16 * 64**3	  4		  6		  4096



Compiling and Running the Benchmark:

1) Choose problem size and number of processors, edit the include file 
   qcd2.inc to set the appropriate parameters.

2) To compile and link the benchmark type:   `make' for the distributed 
   version or `make slave' for the single-node version.

3) If any of the parameters in the include files are changed,
   the code has to be recompiled. The make-file will automatically
   send to the compiler only affected files.

4) On some systems it may be necessary to allocate the appropriate
   resources before running the benchmark, eg. on the iPSC/860
   to reserve a cube of 8 processors, type:    getcube -t8

5) To run either sequential or distributed version of the benchmark,
   type:    qcd2

   The progress of the benchmark execution can be monitored via
   the standard output, whilst a permanent copy of the benchmark output
   is written to a file called 'result'.

6) If the run is successful and a permanent record is required, the
   file 'result' should be copied to another file before the next run
   overwrites it.


3. Hints for Optimisation (Blockshift versus indirect addressing)
-----------------------------------------------------------------
Two routines are provided for the shift operation, blockshift and
shiftvec. Blockshift shifts coherent blocks corresponding to a given
lattice direction. The problem is that the block length is rather small 
for the t & x directions and so with large vector startups the vector 
efficiency is poor. Hence in these directions it is more efficient to 
use the indirect addressing version of the shift routine shiftvec. 

The shift routines are called from the routine dvec which by default 
uses shiftvec in the t and x directions and blockshift in the y & z 
directions. For best performance with smaller lattices, blockshift should 
be used in the y direction.



4. Accuracy Check
-----------------
The output results are best characterised by the total energy per
lattice point (output in columnn 3). The program can be considered to
have run successfully if the following two conditions are met.

1) The total energy should be constant to 5 decimal places for each
   iteration (a small variation in the final 6th place is allowable). 

2) This constant value should be close to 3.0 

Unfortunately it is difficult to be more precise as the fermion and gauge 
fields are initialised by a random number generator. Consequently the exact
value of the total energy is then dependent on the number of processors and 
the problem size. 

$Id: ReadMe,v 1.2 1994/04/20 16:50:10 igl Rel igl $
High Performance Computing Centre
Submitted by Mark Papiani,
last updated on 10 Jan 1995.