Tracing Titanium application communication ========================================= Titanium's GASNet backend include features that can be used to trace communication and find communication bottlenecks in your application on the gasnet-* backends using the new gasnet_trace tool. *** Step 1: compile with the tcbuild-trace version of the compiler and pass -g: tcbuild-trace -g --backend gasnet-lapi-smp arrayCopyTest.ti *** Step 2: Run your program as usual, but use the tcrun -trace flag: setenv TI_THREADS "2 2" tcrun -trace -n 2 arrayCopyTest Alternatively, if you use poe or lapipoe to start jobs, you can emulate the -trace flag by setting the additional GASNET_TRACEFILE and GASNET_TRACEMASK variables shown below (the first controls the name of the resulting trace files, the second controls which types of events end up in the trace): setenv TI_THREADS "2 2" setenv GASNET_TRACEFILE "trace.%" setenv GASNET_TRACEMASK GPBHWXIN setenv GASNET_TRACELOCAL 0 lapipoe 2 1 arrayCopyTest The optional variable GASNET_TRACELOCAL suppresses trace generation of local accesses through global pointers, which can greatly reduce the size and overhead of trace file generation (although it also obviously hides those accesses from the subsequent analysis). When your application runs, you'll see output which begins similarly to this: llsubmit: Processed command file through Submit Filter: "/usr/common/nsg/etc/subfilter". GASNet reporting enabled - tracing and statistical output directed to ti_trace-arrayCopyTest-2-753892-1 GASNet reporting enabled - tracing and statistical output directed to ti_trace-arrayCopyTest-2-753892-0 ----------------------------------------------------------------------- WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING This application was built from a Titanium compiler installation configured with global, forced debugging mode. This usually has a SERIOUS impact on performance, so you should NOT trust any performance numbers obtained from this program run!!! WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING ----------------------------------------------------------------------- Tic: s05413 is OS process 1 of 2 (Ti procs 2..3 of 4) Tic: s05413 is OS process 0 of 2 (Ti procs 0..1 of 4) ... regular program output... Note the important warning about performance - you should not trust any absolute performance numbers obtained from running code built with tcbuild-trace (because all optimizations are disabled, and there is a noticeable cost associated with tracing), however the trace information can still tell you where and how often your program is communicating. Also note the lines telling you that trace output is directed to the files ti_trace-* in the current directory (one per OS process). Note these trace files can grow quite large with long program runs, so ensure you have sufficient disk quota and consider cutting down the problem size when possible. These trace files are regular text files that you can look at in any editor, although they probably contain more detailed information than you'd generally like to see directly. We provide a gasnet_trace tool which summarizes the information in an easy-to-read format. *** Step 4: run gasnet_trace on the generated trace files I'd recommend a set of options like this: $ gasnet_trace -t -f -filter LOCAL ti_trace-* This will dump output to stdout summarizing the communication which took place during the program run represented by the trace. For the purposes of Titanium tracing, you probably always want -filter LOCAL (which filters out wide pointer accesses to memory which is actually local, showing only the true communication), but if you want to see where your program is using wide pointers instead of local pointers, you can remove that option. -f and -t respectively ensure the output includes a complete source file name and is broken out by Titanium thread (for pthreaded jobs). You can see all the gasnet_trace options using -h: $ gasnet_trace -h GASNet trace file summarization script, v1.0 Usage: gasnet_trace [options] trace-file(s) Options: -h -? -help See this message. -o [filename] Output results to file. Default is STDOUT. -report [r1][r2].. One or more capital letters to indicate which reports to generate: P(PUT), G(GET), and/or B(BARRIER). Default: all reports. -sort [f1],[f2]... Sort output by one or more fields: TOTAL, AVG, MIN, MAX, CALLS, TYPE, or SRC. (for GETS/PUTS, TOTAL, AVG, MIN, and MAX refer to message size: for BARRIERS, to time spent in barrier). Default: sort by SRC (source file/line). -filter [t1],[t2].. Filter out output by one or more types: LOCAL, GLOBAL, WAIT, WAITNOTIFY. -t -[no]thread Output detailed information for each thread. -i -[no]internal Show internal events (such as the initial and final barriers) which do not correspond to user source code. -f -[no]full Show the full source file name. Sample output is shown below - note there are currently three different reports (reporting on puts, gets, and barriers) and each line reports information about a given file name / line number in your original program that caused the communication, the number of times it happened, and statistics about message size (for put/get) or barrier stalling time (for barriers). We're actively working on improving gasnet_trace at the moment (so suggestions and bug reports are appreciated - https://mantis.lbl.gov/bugzilla/), but it already provides enough functionality to be of interest and use to our applications users. The trace includes some other information of potential interest to end users (notably tracking memory allocation, collectives and heap sizes) - but we don't currently summarize that in gasnet_trace yet. However, you can view this information by opening thr trace files in any text editor (they are ordinary, self-explanatory text files). GET REPORT: SOURCE LINE TYPE MSG:(min max avg total) CALLS =============================================================================== arrayCopyTest.ti 70 GLOBAL 8 B 128.00 K 56.72 K 1.88 M 34 Thread 0 8 B 128.00 K 56.72 K 964.28 K 17 Thread 2 8 B 128.00 K 56.72 K 964.28 K 17 arrayCopyTest.ti 75 GLOBAL 2.00 K 128.00 K 80.33 K 482.00 K 6 Thread 3 2.00 K 128.00 K 80.33 K 482.00 K 6 arrayCopyTest.ti 81 GLOBAL 8 B 128.00 K 30.16 K 482.56 K 16 Thread 1 8 B 128.00 K 43.84 K 482.28 K 11 Thread 3 8 B 256 B 58 B 288 B 5 arrayCopyTest.ti 87 GLOBAL 8 B 256 B 58 B 576 B 10 Thread 0 8 B 256 B 58 B 288 B 5 Thread 2 8 B 256 B 58 B 288 B 5 PUT REPORT: SOURCE LINE TYPE MSG:(min max avg total) CALLS =============================================================================== arrayCopyTest.ti 63 GLOBAL 40 B 40 B 40 B 480 B 12 Thread 2 40 B 40 B 40 B 240 B 6 Thread 3 40 B 40 B 40 B 240 B 6 arrayCopyTest.ti 64 GLOBAL 40 B 40 B 40 B 480 B 12 Thread 2 40 B 40 B 40 B 240 B 6 Thread 3 40 B 40 B 40 B 240 B 6 arrayCopyTest.ti 70 GLOBAL 40 B 40 B 40 B 1.09 K 28 Thread 2 40 B 40 B 40 B 1.09 K 28 arrayCopyTest.ti 75 GLOBAL 8 B 128.00 K 26.70 K 1.69 M 65 Thread 1 8 B 256 B 58 B 288 B 5 Thread 3 8 B 128.00 K 28.92 K 1.69 M 60 arrayCopyTest.ti 81 GLOBAL 8 B 128.00 K 46.88 K 1.69 M 37 Thread 1 8 B 128.00 K 54.20 K 1.69 M 32 Thread 3 8 B 256 B 58 B 288 B 5 arrayCopyTest.ti 87 GLOBAL 8 B 256 B 58 B 576 B 10 Thread 1 8 B 256 B 58 B 288 B 5 Thread 3 8 B 256 B 58 B 288 B 5 BARRIER REPORT: SOURCE LINE TYPE MSG:(min max avg total) CALLS =============================================================================== arrayCopyTest.ti 31 WAIT 110.0 us 18.3 ms 1.7 ms 39.8 ms 24 Thread 0..1 146.0 us 207.0 us 164.2 us 985.0 us 6 Thread 2..3 110.0 us 18.3 ms 3.2 ms 18.9 ms 6 arrayCopyTest.ti 31 NOTIFYWAIT 13.0 us 39.0 us 28.7 us 688.0 us 24 Thread 0..1 13.0 us 27.0 us 20.7 us 124.0 us 6 Thread 2..3 35.0 us 39.0 us 36.7 us 220.0 us 6 arrayCopyTest.ti 36 WAIT 153.0 us 16.4 ms 5.7 ms 137.7 ms 24 Thread 0..1 153.0 us 226.0 us 166.5 us 999.0 us 6 Thread 2..3 8.0 ms 16.4 ms 11.3 ms 67.8 ms 6 arrayCopyTest.ti 36 NOTIFYWAIT 18.0 us 135.0 us 37.3 us 896.0 us 24 Thread 0..1 18.0 us 25.0 us 23.2 us 139.0 us 6 Thread 2..3 29.0 us 135.0 us 51.5 us 309.0 us 6 arrayCopyTest.ti 46 WAIT 17.0 us 7.3 ms 878.8 us 21.1 ms 24 Thread 0..1 185.0 us 856.0 us 435.3 us 2.6 ms 6 Thread 2..3 17.0 us 7.3 ms 1.3 ms 7.9 ms 6 arrayCopyTest.ti 46 NOTIFYWAIT 19.0 us 68.0 us 33.2 us 796.0 us 24 Thread 0..1 19.0 us 31.0 us 23.5 us 141.0 us 6 Thread 2..3 31.0 us 68.0 us 42.8 us 257.0 us 6 arrayCopyTest.ti 51 WAIT 118.0 us 342.0 us 175.2 us 4.2 ms 24 Thread 0..1 150.0 us 245.0 us 171.8 us 1.0 ms 6 Thread 2..3 118.0 us 342.0 us 178.7 us 1.1 ms 6 arrayCopyTest.ti 51 NOTIFYWAIT 13.0 us 40.0 us 27.0 us 648.0 us 24 Thread 0..1 13.0 us 23.0 us 18.8 us 113.0 us 6 Thread 2..3 30.0 us 40.0 us 35.2 us 211.0 us 6 arrayCopyTest.ti 57 WAIT 129.0 us 340.0 us 184.7 us 4.4 ms 24 Thread 0..1 146.0 us 183.0 us 156.2 us 937.0 us 6 Thread 2..3 129.0 us 340.0 us 213.2 us 1.3 ms 6 arrayCopyTest.ti 57 NOTIFYWAIT 12.0 us 31.0 us 23.9 us 574.0 us 24 Thread 0..1 12.0 us 24.0 us 18.7 us 112.0 us 6 Thread 2..3 27.0 us 31.0 us 29.2 us 175.0 us 6 arrayCopyTest.ti 63 WAIT 16.0 us 2.4 ms 282.2 us 13.5 ms 48 Thread 0..1 295.0 us 2.4 ms 547.6 us 6.6 ms 12 Thread 2..3 16.0 us 18.0 us 16.8 us 201.0 us 12 arrayCopyTest.ti 63 NOTIFYWAIT 13.0 us 130.0 us 58.5 us 2.8 ms 48 Thread 0..1 13.0 us 105.0 us 48.3 us 580.0 us 12 Thread 2..3 59.0 us 130.0 us 68.8 us 825.0 us 12 arrayCopyTest.ti 64 WAIT 16.0 us 533.0 us 195.3 us 9.4 ms 48 Thread 0..1 285.0 us 533.0 us 366.8 us 4.4 ms 12 Thread 2..3 16.0 us 99.0 us 23.9 us 287.0 us 12 arrayCopyTest.ti 64 NOTIFYWAIT 12.0 us 159.0 us 60.4 us 2.9 ms 48 Thread 0..1 12.0 us 78.0 us 48.7 us 584.0 us 12 Thread 2..3 61.0 us 159.0 us 72.1 us 865.0 us 12 arrayCopyTest.ti 70 WAIT 16.0 us 136.5 ms 1.7 ms 1.6 s 932 Thread 0..1 140.0 us 136.5 ms 1.9 ms 432.2 ms 233 Thread 2..3 16.0 us 38.9 ms 1.6 ms 367.7 ms 233 arrayCopyTest.ti 70 NOTIFYWAIT 12.0 us 168.0 us 41.4 us 38.6 ms 932 Thread 0..1 12.0 us 114.0 us 31.1 us 7.2 ms 233 Thread 2..3 25.0 us 168.0 us 51.7 us 12.0 ms 233 arrayCopyTest.ti 72 WAIT 111.0 us 9.0 ms 1.1 ms 21.9 ms 20 Thread 0..1 156.0 us 403.0 us 234.4 us 1.2 ms 5 Thread 2..3 111.0 us 9.0 ms 2.0 ms 9.8 ms 5 arrayCopyTest.ti 72 NOTIFYWAIT 15.0 us 67.0 us 27.1 us 542.0 us 20 Thread 0..1 15.0 us 67.0 us 29.6 us 148.0 us 5 Thread 2..3 24.0 us 27.0 us 24.6 us 123.0 us 5 arrayCopyTest.ti 76 WAIT 112.0 us 357.0 us 204.4 us 4.1 ms 20 Thread 0..1 146.0 us 357.0 us 205.0 us 1.0 ms 5 Thread 2..3 112.0 us 299.0 us 203.8 us 1.0 ms 5 arrayCopyTest.ti 76 NOTIFYWAIT 28.0 us 59.0 us 33.7 us 674.0 us 20 Thread 0..1 28.0 us 40.0 us 32.4 us 162.0 us 5 Thread 2..3 29.0 us 59.0 us 35.0 us 175.0 us 5 arrayCopyTest.ti 78 WAIT 121.0 us 251.0 us 159.1 us 3.2 ms 20 Thread 0..1 146.0 us 251.0 us 176.6 us 883.0 us 5 Thread 2..3 121.0 us 210.0 us 141.6 us 708.0 us 5 arrayCopyTest.ti 78 NOTIFYWAIT 18.0 us 44.0 us 28.8 us 576.0 us 20 Thread 0..1 18.0 us 23.0 us 20.4 us 102.0 us 5 Thread 2..3 27.0 us 44.0 us 37.2 us 186.0 us 5 arrayCopyTest.ti 82 WAIT 85.0 us 228.0 us 152.2 us 3.0 ms 20 Thread 0..1 143.0 us 173.0 us 160.4 us 802.0 us 5 Thread 2..3 85.0 us 228.0 us 144.0 us 720.0 us 5 arrayCopyTest.ti 82 NOTIFYWAIT 29.0 us 90.0 us 42.7 us 854.0 us 20 Thread 0..1 40.0 us 45.0 us 42.8 us 214.0 us 5 Thread 2..3 29.0 us 90.0 us 42.6 us 213.0 us 5 arrayCopyTest.ti 84 WAIT 108.0 us 226.0 us 164.8 us 3.3 ms 20 Thread 0..1 154.0 us 218.0 us 180.0 us 900.0 us 5 Thread 2..3 108.0 us 226.0 us 149.6 us 748.0 us 5 arrayCopyTest.ti 84 NOTIFYWAIT 18.0 us 52.0 us 31.2 us 624.0 us 20 Thread 0..1 18.0 us 21.0 us 19.0 us 95.0 us 5 Thread 2..3 30.0 us 52.0 us 43.4 us 217.0 us 5 arrayCopyTest.ti 88 WAIT 112.0 us 211.0 us 170.3 us 3.4 ms 20 Thread 0..1 150.0 us 202.0 us 175.4 us 877.0 us 5 Thread 2..3 112.0 us 211.0 us 165.2 us 826.0 us 5 arrayCopyTest.ti 88 NOTIFYWAIT 24.0 us 94.0 us 41.3 us 826.0 us 20 Thread 0..1 25.0 us 61.0 us 39.2 us 196.0 us 5 Thread 2..3 24.0 us 94.0 us 43.4 us 217.0 us 5 arrayCopyTest.ti 91 WAIT 78.0 us 273.0 us 158.1 us 3.2 ms 20 Thread 0..1 149.0 us 258.0 us 176.0 us 880.0 us 5 Thread 2..3 78.0 us 273.0 us 140.2 us 701.0 us 5 arrayCopyTest.ti 91 NOTIFYWAIT 12.0 us 52.0 us 30.3 us 606.0 us 20 Thread 0..1 12.0 us 21.0 us 17.6 us 88.0 us 5 Thread 2..3 31.0 us 52.0 us 43.0 us 215.0 us 5 Titanium also provides the ti.lang.Trace class, which enables programmatic modification of the tracing settings for the gasnet-* backends: /* ti.lang.Trace provides a language-level interface to the runtime's tracing & statistical collection system, which may only be available on some configurations */ public class Trace { /* output an arbitrary message into the trace log, if it exists */ public static native void print(String s); /* fetch the mask of currently enabled trace/stat events, in a configuration-dependent format */ public static native String getTraceMask(); public static native String getStatsMask(); /* set the mask for the currently enabled trace/stat events, in a configuration-dependent format Note the effective trace mask setting may be shared between several Titanium threads */ public static native void setTraceMask(String mask); public static native void setStatsMask(String mask); /* fetch/set whether or not local tracing is enabled */ public static native boolean getTraceLocal(); public static native void setTraceLocal(boolean newval); }