==================================================================
=== ===
=== GENESIS Distributed Memory Benchmarks ===
=== ===
=== LPM1 ===
=== ===
=== Local Particle-Mesh Device Simulation ===
=== ===
=== Author: Roger Hockney ===
=== Department of Electronics and Computer Science ===
=== University of Southampton ===
=== Southampton SO9 5NH, U.K. ===
=== fax.:+44-703-593045 e-mail:rwh@uk.ac.soton.ecs ===
=== ===
=== Copyright: SNARC, University of Southampton ===
=== ===
=== Last update: June 1993; Release: 2.2 ===
=== ===
==================================================================
1. Description
--------------
This benchmark is the simulation of an electronic device
using a particle-mesh (PM) method, often also called a
particle-in-cell (PIC) simulation. In each timestep the
electric and magnetic fields on an (LMAX x MMAX) mesh are
advanced explicitly in time using Maxwell's equations, and
the particles (electrons) are advanced in the fields using
Newton's equations.
The benchmark is described as local because the time scale
is such that the fields may be computed explicitly, using
fields only local to each mesh point. four benchmark cases
are provided (NBEN3=1,2,3,4), giving four problem sizes
described by the size factor alpha=1,2,4,8 and mesh numbers
(75*alpha,33). the number of particles at the end of the
run of 1 picosecond is given empirically by
628*alpha**1.172
As the number of mesh-points increases for the same physical
dimension, the time-step must be reduced to satisfy the CFL
stability criterion. This effect has an important influence
on the meaning of the performance metrics. The performance
is expressed in several different metrics (and units) for
comparison purposes. As well as the traditional Speedup and
Efficiency, we give the Temporal (tstep/s), Simulation
(sim-ps/s), and Benchmark (mflop/s(lpm1)) performance, which
are much more meaningful and useful measures.
Parallelisation is by one-dimensional domain decomposition,
in the first coordinate. Each processor is responsible for
a slab of space, and stores the mesh-ponts and coordinates
of particles in its region of space. During each timestep,
particle coordinates are transferred between processors as
the particles move from region to region.
Error Check
-----------
Because the simulation uses random numbers, the multi-processor
calculation cannot be expected to give identical results to the
uni-processor calculation. however, the percentage difference
in particle number, NP, and average B-field, BAV, in the last
timestep, should not exceed a few percent.
Calculations are accepted if differences < 10%
Temporal Performance
--------------------
Temporal performance is the inverse of the execution time,
here expressed in units of timestep per second (tstep/s).
This is the fundamental metric of performance, because it
is in absolute units and one can guarantee that the code with
the highest temporal performance executes in the least time.
Speedup and Efficiency
----------------------
Speedup, Sp, has the traditional definition of the ratio of
1-proc to n-proc. execution time, and Efficiency, Ep, is
Speedup per processor. Because Speedup is a relative
measure, the program with the highest Speedup may not
execute in the least time! Be warned.
Simulation Performance
----------------------
This metric measures the amount of simulated time computed
in one real wall-clock second. It is the most meaningful
metric for a simulation, because it is what the user actually
wishes to maximise. For this benchmark, the units are
simulated picosecond per second (sim-ps/s). In this metric
larger problems with more mesh points run slower (which in
fact they do), although they generate more Speedup and
Mflop/s! This metric also includes the fact that problems
with a smaller space step often must use a smaller timestep,
and therefore take more timesteps to cover the same amount
of simulated time.
Benchmark Performance
---------------------
This metric is calculated from the nominal number of
floating-point operations needed to perform the benchmark
on a single processor. For the one-nanosecond benchmark
setup here, the average number of floating-point operations
per timestep is defined to be:
F_b(alpha) = 46*75*33*alpha + 58*628*alpha**1.172
where the size factor alpha=1,2,4,8 for cases NBEN3=1,2,3,4.
The first term above is the work to update the fields on the
mesh, and the second term is the work to move the particles.
Then the benchmark performance is
R_b(alpha,p) = F_b(alpha)/Tp(alpha,p)
Performance calculated in this way has the units
Mflop/s(LPM1). Different parallel implementations may,
in fact, perform more or fewer operations than the above, but
they are only credited with the number given by the formula.
Because F_b is fixed for all codes, we can quarantee that the
code with the highest benchmark performance executes in the
least time.
Operating Instructions
----------------------
To compile and link the benchmark type: `make' for the distributed
version or `make slave' for the single-node version.
To run a recompiled program (e.g. on Intel iPSC), type:
getcube -t4 ! to allocate cube
lpm1 ! to run benchmark.
In some systems the allocate command may not be necessary.
Then answer one question:
(1) Number of nodes for mimd run is, at maximum, equal to
the number of nodes allocated by getcube (4 in above example).
This is the number of nodes (processors) to be used in the
calculation. Value maybe: 1, 2, ... , maximum nodes (here 4)
Note: For every problem size, the 1-processor calculation must be
performed once, to obtain reference time for Speedup measure.
The timing results are stored in the four check result files:
res1p.size1, ... , res1p.size4.
We recommend your first run is with 1-processor, otherwise
speedup will be printed as zero. You do not have to rerun
when you change the number of processors allocated
The results for the four problem sizes, cases 1,2,3 and 4, and
different number of processors are put automatically in
different output files, with notation (for example):
lpm1c3p25 - output for lpm1 benchmark, case 3 for 25 processors
If you wish to put the files elsewhere there is a prompt to
tell you when to do it with a Unix cp command.
Files
-----
lpm1.u - host program, contains PARMACS for host.
node.u - node main program and all communication interface
routines, therefore all node PARMACS calls are here.
benctl.f - benchmark control, may be changed to modify
output, but usually left alone. No PARMACS here.
lpm1bk.f - body of benchmark code. Not to be touched.
res1p.size1 - correct results on one processor for standard
size problem, case1, (75x33) mesh.
res1p.size2 - results for case2 problem (150x33) mesh.
res1p.size3 - results for case3 problem (300x33) mesh.
res1p.size4 - results for case4 problem (600x33) mesh.
secowa.f - LPM1 program second timer, which calls
timer.f - the standard benchmark system timer
header.f - standard header information
setdat.f - puts date on results
setdtl.f - compiler and system details
lpm1c4p100 - etc, output files generated by program
$Id: ReadMe,v 1.2 1994/04/20 17:33:35 igl Rel igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.