Titanium v2.411 has been installed on the NPACI datastar.sdsc.edu machine (the
replacement for NPACI Blue Horizon), in the following location:

/usr/local/apps/titanium/dist/bin/tcbuild    -- 32-bit
/usr/local/apps/titanium/dist64/bin/tcbuild  -- 64-bit

This email contains very valuable information about the machine and using
Titanium on it. Please peruse it and save it as a guide for later usage.

Access
------

Everyone who previously held Blue Horizon accounts should already have access
to DataStar (although you may need to contact the consultants for a password
reset), and we should be able to add DataStar accounts for any new users who
want them. 

The machine usage and configuration is described here:

   http://www.npaci.edu/DataStar/

Executive summary of the hardware: 
--------------------------------
DataStar is a Power4/Federation IBM SP machine - the hardware successor to the
Power3/Colony (aka. seaborg). All the system software is comparable, but the
CPU and network hardware is superior in almost every way (the only exception
being the seaborg machine has more total CPU's).

DataStar has a mix of 8-way nodes with 16 GB, and 32-way nodes with 128-256
GB. Each Power4 CPU runs at 1.6GHz and delivers 6.8 GFlops peak performance
(compare to seaborg's 375MHz/1.5GFlops per CPU). Each Power4 CPU has a two-way
associative L1 (32 KB) cache, and a four-way associative L2 (1.4 MB) cache,
and the CPU's on a node share an 8-way associative L3 cache (16 MB per
processor). The Federation network features a brand-new switch that delivers
measurably improved communication bandwidth and latency.

Titanium computational performance:
----------------------------------

To give you some idea what all this fancy hardware could mean in terms of your
Titanium application performance, here's a microbenchmark timing of the cost
of randomly indexing into a Titanium array with no strength-reduction on
gasnet-lapi-smp (a common operation which appears in many applications - eg
A[B[i]]):

	                Seaborg      DataStar
 1-d random access:   14.408 ns     3.736 ns
 2-d random access:   27.959 ns     7.648 ns
 3-d random access:   47.523 ns    12.008 ns

That's about a 4x performance improvement!! (and all you have to do to get it is recompile!)

As another example, the performance improvement for small Titanium array
copies (a bottleneck in Tong's code) sees similar and even larger performance
improvement ratios:

	                                 Seaborg     DataStar
small contiguous local 1-d array copy:  0.611 us    0.176 us
small contiguous local 2-d array copy:  1.113 us    0.278 us
small contiguous local 3-d array copy:  1.490 us    0.384 us

non-contiguous local 1-d array copy:    6.608 us    2.332 us 
non-contiguous local 2-d array copy:    9.366 us    3.062 us 
non-contiguous local 3-d array copy:   10.841 us    3.430 us 

Many other computational operations I compared showed a similar performance
improvement ratio of about 3x-4x on datastar over seaborg. Clearly the zippy
new Power4 processors really pay off in terms of achievable computational
performance - I'm eager to see how much speed up our full applications can
get... With all this beefy new CPU power at our disposal, it seems a realistic
goal to get the heart code well below 1 second per timestep..

Backends available:
------------------

The recommended backends on this machine are gasnet-lapi-uni and
gasnet-lapi-smp. These backends supercede sp3 (which is not supported on
DataStar), and should provide the best performance on the SP. They're also
compatible with native code using MPI (e.g. FFTW-MPI), and will still provide
the best possible LAPI-based performance for communication outside the MPI
code. This means you have every reason to try it out now.

All application writers are recommended to immediately switch to using
gasnet-lapi-uni (if you use one thread per task - which includes any app
calling FFTW-MPI) or gasnet-lapi-smp if you use multiple threads per task (eg.
TI_THREADS="8 8"). The only thing you should need to change to use these
backends is the --backend switch to tcbuild - everything else should be
unchanged from sp3 or mpi-cluster-*. gasnet-lapi-* provides a slight
improvement in small message latency and a huge improvement in bandwidth over
mpi-cluster-*.

If you were previously using the mpi-* backends and want to run your
pure-Titanium program on gasnet-lapi-*, you'll need to update your poe command
as follows:
 poe -msg_api lapi

If you want to use both lapi and mpi (i.e. gasnet-lapi-* and FFTW-MPI), you need to specify:
  poe -msg_api lapi,mpi 
that way both layers are available for use (note this also limits you to 8
tasks per node, since mpi and lapi each need their own network "window" and
the seaborg nodes only have 16 windows).

Note there are some tricky potential deadlocks in the mixing of LAPI (Titanium
communication) + MPI (libraries like FFTW-MPI) - the easiest solution is to
put Ti.barrier() calls around the sections of code that invoke MPI.

Other miscellaneous recommendations:

DataStar has gobs of physical memory, so if your application is memory hungry
then you should probably be using 64-bit mode (as opposed to adding more
processors just to get more memory). 32-bit mode limits your total memory
usage to 2GB per OS process (due to VM space limitations), but 64-bit mode
allows a single process to address and use the entire physical memory. Best of
all, the only thing required to use 64-bit mode is to recompile using the
64-bit version of tcbuild (see pathname at top of email), and ensure you are
linking 64-bit versions of any external libraries (eg FFTW). As of AIX 5.2,
the kernel is now fully 64-bit, which means that some system operations may
even be faster in 64-bit mode than 32-bit mode. So give it a try!

The DataStar install of the Titanium translator was compiled with IBM C++,
which we've never used before. The entire regression suite passes and the
compiler appears stable, but please report any compilation problems.
Incidentally, Titanium compilations run noticeably MUCH faster on DataStar,
thanks to the faster scalar processors.

Interactive jobs are currently limited to a single node - consult the DataStar
webpage given above for sample batch scripts and plenty of explanation on
submitting batch jobs.

Enjoy..
Dan