Titanium v2.411 has been installed on the NPACI datastar.sdsc.edu machine (the replacement for NPACI Blue Horizon), in the following location: /usr/local/apps/titanium/dist/bin/tcbuild -- 32-bit /usr/local/apps/titanium/dist64/bin/tcbuild -- 64-bit This email contains very valuable information about the machine and using Titanium on it. Please peruse it and save it as a guide for later usage. Access ------ Everyone who previously held Blue Horizon accounts should already have access to DataStar (although you may need to contact the consultants for a password reset), and we should be able to add DataStar accounts for any new users who want them. The machine usage and configuration is described here: http://www.npaci.edu/DataStar/ Executive summary of the hardware: -------------------------------- DataStar is a Power4/Federation IBM SP machine - the hardware successor to the Power3/Colony (aka. seaborg). All the system software is comparable, but the CPU and network hardware is superior in almost every way (the only exception being the seaborg machine has more total CPU's). DataStar has a mix of 8-way nodes with 16 GB, and 32-way nodes with 128-256 GB. Each Power4 CPU runs at 1.6GHz and delivers 6.8 GFlops peak performance (compare to seaborg's 375MHz/1.5GFlops per CPU). Each Power4 CPU has a two-way associative L1 (32 KB) cache, and a four-way associative L2 (1.4 MB) cache, and the CPU's on a node share an 8-way associative L3 cache (16 MB per processor). The Federation network features a brand-new switch that delivers measurably improved communication bandwidth and latency. Titanium computational performance: ---------------------------------- To give you some idea what all this fancy hardware could mean in terms of your Titanium application performance, here's a microbenchmark timing of the cost of randomly indexing into a Titanium array with no strength-reduction on gasnet-lapi-smp (a common operation which appears in many applications - eg A[B[i]]): Seaborg DataStar 1-d random access: 14.408 ns 3.736 ns 2-d random access: 27.959 ns 7.648 ns 3-d random access: 47.523 ns 12.008 ns That's about a 4x performance improvement!! (and all you have to do to get it is recompile!) As another example, the performance improvement for small Titanium array copies (a bottleneck in Tong's code) sees similar and even larger performance improvement ratios: Seaborg DataStar small contiguous local 1-d array copy: 0.611 us 0.176 us small contiguous local 2-d array copy: 1.113 us 0.278 us small contiguous local 3-d array copy: 1.490 us 0.384 us non-contiguous local 1-d array copy: 6.608 us 2.332 us non-contiguous local 2-d array copy: 9.366 us 3.062 us non-contiguous local 3-d array copy: 10.841 us 3.430 us Many other computational operations I compared showed a similar performance improvement ratio of about 3x-4x on datastar over seaborg. Clearly the zippy new Power4 processors really pay off in terms of achievable computational performance - I'm eager to see how much speed up our full applications can get... With all this beefy new CPU power at our disposal, it seems a realistic goal to get the heart code well below 1 second per timestep.. Backends available: ------------------ The recommended backends on this machine are gasnet-lapi-uni and gasnet-lapi-smp. These backends supercede sp3 (which is not supported on DataStar), and should provide the best performance on the SP. They're also compatible with native code using MPI (e.g. FFTW-MPI), and will still provide the best possible LAPI-based performance for communication outside the MPI code. This means you have every reason to try it out now. All application writers are recommended to immediately switch to using gasnet-lapi-uni (if you use one thread per task - which includes any app calling FFTW-MPI) or gasnet-lapi-smp if you use multiple threads per task (eg. TI_THREADS="8 8"). The only thing you should need to change to use these backends is the --backend switch to tcbuild - everything else should be unchanged from sp3 or mpi-cluster-*. gasnet-lapi-* provides a slight improvement in small message latency and a huge improvement in bandwidth over mpi-cluster-*. If you were previously using the mpi-* backends and want to run your pure-Titanium program on gasnet-lapi-*, you'll need to update your poe command as follows: poe -msg_api lapi If you want to use both lapi and mpi (i.e. gasnet-lapi-* and FFTW-MPI), you need to specify: poe -msg_api lapi,mpi that way both layers are available for use (note this also limits you to 8 tasks per node, since mpi and lapi each need their own network "window" and the seaborg nodes only have 16 windows). Note there are some tricky potential deadlocks in the mixing of LAPI (Titanium communication) + MPI (libraries like FFTW-MPI) - the easiest solution is to put Ti.barrier() calls around the sections of code that invoke MPI. Other miscellaneous recommendations: DataStar has gobs of physical memory, so if your application is memory hungry then you should probably be using 64-bit mode (as opposed to adding more processors just to get more memory). 32-bit mode limits your total memory usage to 2GB per OS process (due to VM space limitations), but 64-bit mode allows a single process to address and use the entire physical memory. Best of all, the only thing required to use 64-bit mode is to recompile using the 64-bit version of tcbuild (see pathname at top of email), and ensure you are linking 64-bit versions of any external libraries (eg FFTW). As of AIX 5.2, the kernel is now fully 64-bit, which means that some system operations may even be faster in 64-bit mode than 32-bit mode. So give it a try! The DataStar install of the Titanium translator was compiled with IBM C++, which we've never used before. The entire regression suite passes and the compiler appears stable, but please report any compilation problems. Incidentally, Titanium compilations run noticeably MUCH faster on DataStar, thanks to the faster scalar processors. Interactive jobs are currently limited to a single node - consult the DataStar webpage given above for sample batch scripts and plenty of explanation on submitting batch jobs. Enjoy.. Dan