Overview
AMLAPI is an implementation of the
AM-2
active message specification for the IBM SP using LAPI. AMLAPI
was originally written by
Simon Yau
at UC Berkeley as part of a class project. The main use of this
implementation is as the run-time communication layer for the UC Berkeley
Titanium
compiler. The version presented here is a modification to Simon's
library in an effort to improve Titanium communication performance. The
modifications were performed by myself and
Dan Bonachea
of UC Berkeley.
Short Message Performance
Titanium relies heavily on AM-2 short and medium length messages. AMLAPI
implements request and reply handlers using the general LAPI active message
call (LAPI_Amsend). As such, when the first packet for this message
arrives at the target, the header handler mallocs space for the message and
informs the dispatcher where to place the message, it also registers a completion
handler to be called when the entire message arrives. In addition,
the header handler places a token on the AM bundle task queue for this message.
Upon return from the header_handler, the dispatcher acknowledges the
original packet and arranges for all incoming data for this message to be
written to the supplied buffer. Once the entire message arrives, the
completion handler runs (in a special completion handler thread) and marks
the token on the task queue as "ready". The AM request and reply handler
are executed during a call to AM-Poll. AM_Poll examines the first
element of the queue and, if marked "ready", executes the corresponding user
handler. AM request and reply handlers must run in the context of
the user application thread(s) when they enter the bundle, thus the completion
handler cannot execute the user handler directly. Request handlers are
required to issue an AM reply, causing another LAPI_Amsend function call in
the opposite direction.
Clearly, this presents a substantial overhead for small messages. The
main performance gain we achieved was to pack the short and medium length
messages into the argument structure that gets delivered to the header handler.
The header handler requires this data, so it must be included in the
first (1KB) packet sent to the target. This argument structure is user
defined, but has a size limitation of 864 bytes because of the 1KB packet
size and other overhead. For messages of this size and less, the header
handler can immediately place the token on the bundle task queue mark it
ready for processing. It returns a NULL pointer to the dispatcher, indicating
that no additional data need be collected and does not have to register a
completion handler. AM_Poll is now free to execute the user request
or reply handler as soon as possible. We added another optimization
to allow the header_handler to execute the user handler, if it is executing
in an application thread (as opposed to of of the special LAPI threads).
This is only possible for AM reply handlers because they are not allowed
to issue communication calls. LAPI communication calls cannot be performed
in LAPI header handlers because of deadlock conditions. The other main
optimization was to re-implement the bundle task queue without locking.
AMLAPI does not allow parallel access to the bundle (only AM_Seq mode)
and therefore there is a single producer thread, single consumer thread
access to the task queue. The queue is implemented as a linked list
whereby producer and consumer are prevented from updating the same data
through the use of a "firewall" element in the list. See the discussion
in amlapi_task_queue.c for an explanation.
The following graph shows the latency for an AM-2 ping code in which messages
of the given size are sent to the target node with an acknowledgment sent
back via the reply handler. Note the increase in latency from about
60-65 microseconds to over 100 as the message size goes above the 864 byte
limit. The additional 40 microseconds is attributed to the fact that
a completion handler must be used in the case where the message is not entirely
contained in the header handler argument structure. In this case, when
the entire message has arrived, the LAPI dispatcher will signal the LAPI
completion handler thread. The completion handler simply marks the
request as "ready" for the bundle to execute its request handler. Additional
testing with pure LAPI programs (as opposed to AMLAPI programs) verifies
a 35-40 microsecond overhead when a completion handler must be scheduled,
even if it does no work. It is known that the completion handler thread
is created at system contention scope and therefore maps directly to an AIX
kernel thread. That is, it does not share the kernel thread with other
application threads at process contention scope. One possibility is
that the completion handler thread is contending for the same CPU as the
user thread(s). The application tasks ran on dedicated 16 CPU SMP nodes
so there was no contention with other applications. Further, each task
used only 4 threads: the user thread, the LAPI notification thread, the LAPI
completion handler thread and the LAPI retransmission thread, so there should
be no contention of threads with CPUs. In one test, we set the pthread
concurrency level to 10, but this did not change the performance. Of
course, the AIX pthreads implementation is free to ignore this hint. Further
investigation is required to explain, and possibly remove, this overhead.
Note that even the best performance of 50-60 microseconds is substantial
in comparison to what can be achieved using LAPI_Put or LAPI_Get. See
this
page
for additional information on LAPI performance using Put and Get operations.
Finally, note that the best, and most consistent performance is obtained
when LAPI runs in polling mode, rather than interrupt mode.
The graph below shows bandwidth curves for AMLAPI on the NERSC IBM SP. Using
LAPI_Put and LAPI_Get in polling mode we can achieve better than 300 MB/sec
on messages of about 128K.
Code Modifications by Dan
Bonachea
- Bug fix AMLAPI_ExecuteTaskFromBundle. nbytes variable not set.
- Modified code to allow for a bundle vm_segment to contain an entire
32 bit address space. This allows Titanium to use the AM_Xfer calls
for large messages rather than having to rely on short and medium messages.
- removed extraneous malloc calls for the token and argv structures within
the header handler.
- removed deadlock condition in AM_POLL.
- removed unnecessary locks.
- added "volitile" qualifier to single producer/single consumer task
queue data structure.
- added some error checking
Code Modifications by Mike
- Packing short and medium AM messages into head handler argument
structure to reduce latency.
- Allowing reply handlers to be run directly by header_handler in
the event that the header_handler is executing in a user (as opposed to
a special LAPI) thread.
- Re-implementation of bundle task queue as a linked list, such that
a single producer and single consumer can access and modify the list simultaneously
without locking.
- Linked list task queue implementation allows header handler to pass
the address of the token for the message directly to the completion handler.
Completion handler no longer has to search list.
- AM_Poll now searches the entire task queue for the first "ready"
entry rather than only looking at the first element.
- Implement a buffer manager for fast allocation and de-allocation
of memory regions for medium sized messages.
- Added LAPI_Probe calls to AM_Poll and AM_WaitSema functions to make
LAPI progress while waiting for an event to occur. Previously, in LAPI_Polling
mode, progress was made only by the LAPI re-transmission thread when its
timer poped every 400000 microseconds.
- Puts AMLAPI into LAPI polling mode by default. Added AMLAPI
function to change the mode.
- General bug fixes and code cleanup.
Source Code
Click
AMLAPI.tgz
for the latest version (1.4) of AMLAPI for Titanium.
LAPI Notes
LAPI will create three additional threads during initialization, for each
LAPI task:
- The Notification Thread. This
thread is only active if LAPI is put into Interrupt mode. The thread
is created when LAPI registers a callback function with the HAL layer. The
callback is activated in Interrupt mode when one or more packets arrive
in the switch adaptor receive FIFO. The callback attempts to make
progress on existing send and receive messages by entering the LAPI dispatcher.
- The Retransmission Thread. This
thread is created when LAPI registers a timeout callback with the HAL layer.
It executes every 400000 microseconds. It checks for unacknowledged
messages and issues re-transmissions. It also attempts to make progress
by entering the LAPI dispatcher. In reality, the dispatcher deals
with most re-transmission events, this thread deals with the case where
the dispatcher has not been entered in a long time. This might happen
if LAPI is in polling mode and no communication calls have been made, or
if there are switch problems and messages are not getting through. If
there is no progress on a message after TIMEOUT seconds, then the retransmission
thread terminates the code.
- The Completion Handler Thread. This
thread is created by LAPI_Init to run active message completion handler
functions.
Since AMLAPI puts LAPI in polling mode by default, the Notification Thread
will never run. The Retransmission thread will only run every 400000
microseconds so should probably not have a processor reserved for it. The
completion handler thread will only run for large messages (which cannot
get packed into the header handler argument struct). Further, the completion
handler is very light-weight, it simply sets a variable in a task queue
structure. Given this, you should reserve at most one processor for
LAPI overhead.