Technology: Web Services Archives
Replay Reconsidered
Posted by gpilz on April 25, 2008 at 9:35 AM | Permalink
| Comments (4)
WS-ReliableMessaging describes a protocol that allows SOAP messages to be delivered reliably between distributed applications in the presence of software component, system, or network failures. One issue that has long bedeviled WS-RM is how to support reliable responses to so-called "anonymous clients". The OASIS WS-RX Technical Committee created the WS-MakeConnection specification to deal with this issue. Another, alternate solution is the use of the "replay model". This article describes the technical defects of this model.
It is assumed that readers of this article are familiar with the basic principals and operation of the WS-ReliableMessaging protocol. If you are less than familiar with WS-RM, this Wikipedia entry is a good place to get started.
Core Dilemma
The core dilemma behind this issue is that "anonymous clients" (I prefer the term "non-addressable clients" because I don't like to conflate the concepts of addressability with those of identity) can only communicate synchronously yet WS-RM, by its nature, potentially renders all communications asynchronous. Uh huh. Let's break that down a bit.
Non-addressable clients are hosted on computers that, for reasons of network topology (i.e. NATs), security (i.e. firewalls), or whatever, cannot accept connections from systems outside their network. Although you can't connect to these machines from the outside, they themselves can create outbound connections. SOAP supports non-addressable clients by leveraging HTTP to take advantage of this fact. Non-addressable SOAP clients create an outbound connection to a server, send the request message over this connection, then read the corresponding response from that same connection (this response channel is sometimes referred to as "the HTTP back-channel"). This is why non-addressable clients operate synchronously. They have to use the connection they created to read the server's response because, by definition, it is impossible for the server to connect to them and send the response (as would happen in an asynchronous exchange). For readers accustomed to thinking in terms of synchronous communication this all seems par for the course, but wait, there's more.
WS-RM is built on the concepts of acknowledglements and retransmissions. One node (client, server, whatever) sends a message to another and waits for an acknowledgement. If it doesn't receive one it assumes the message didn't get through and sends it again. So, regardless of when you think you are going to receive a message and which connection you think you are going to receive that message over, something may go wrong (the connection might break) and WS-RM will retransmit the message at a later time over a different connection. This doesn't present a problem for non-addressable clients on the request side (where they control the creation of new connections) but it is a problem on the response side. Suppose you are a server in the process of sending a reliable response to a non-addressable client and the connection goes down. Obviously you are never going to get an acknowledgment for that response message so, as a WS-RM node, it is your responsibility to resend it. But how are you supposed to do that? You can't connect to the client and re-send the response because the client is not addressable.
Replay Redux
As I said earlier, the OASIS WS-RX Technical Committee created the WS-MakeConnection specification as a means of addressing this problem. WS-MakeConnection is a very important piece of technology as I will explain in a later article. Another solution that predates the work of the WS-RX TC is the use of "replays". The best description of the replay model is this whitepaper by WS02. Although this article describes the use of replay in the context of WS-RM 1.0, some implementations (most notably Microsoft® Windows Communication Foundation (WCF)) have extended this solution to include WS-RM 1.1. Replay takes advantage of the fact that non-addressable clients can create new outbound connections and uses the retransmission of a (possibly acknowledged) request to solicit the retransmission of the corresponding response. On the surface these seems like a reasonable approach but, as I will show, there are a number of serious technical issues around its implementation and use.
Abstraction Layer Violations
One of the most serious issues with the implementation of the replay model is that it requires the RMS to be aware of the message exchange pattern of the messages it processes. To understand why this is so we need to review the normal processing sequence for an RMS. An RMS receives a message from the higher-level Application Source (AS). The RMS then transmits the request message to the RMD. Since the RMS is responsible for re-transmitting the request message it must store that message (in memory and/or on disk) until it receives an acknowledgment from the RMD. When the acknowledgment is received the RMS can "forget" about the message. Not so when replay is in effect. Because the replay model uses request messages as a prompt for lost response messages, the RMS must store requests until the corresponding response as been received even after the request itself has been acknowledged. But wait, what if there is no response message? What if the request message is the sole message in a one-way exchange? We obviously can't have the RMS storing these one-way messages forever, so the RMS needs to know whether the message it is processing is part of a request-response exchange or a one-way message.
OK, why is this such a big deal? To understand why this is an issue we need to think about the basic architecture of SOAP and the composability of web service specifications. One of SOAP's big claims is that you can add additional facilities (like reliability) in a way that is transparent to both the application and to any other facilities. Underlying this assertion is the notion that most SOAP stacks will implement some form of the chain of responsibility pattern. This means that the only parts of the SOAP processing pipeline that should be aware of the exchange pattern being used are the initiator and the ultimate receiver. Requiring the handler that implements WS-RM to know the exchange pattern in effect for the messages it handles runs counter to this entire architecture. Does that mean you couldn't hack around this problem in some way? Of course you could! But these kind of hacks are likely to work only in specific instances (i.e. when the WS-RM processor and the initiator share the same process space, etc.) and will, ultimately, lead to a SOAP stack that is buggy and fragile (or should I say "buggier and more fragile"?).
Request-Response Correlation
Another problem with implementing the replay model is the fact that the server-side WS-RM handler must maintain the correlation between the requests and responses it has processed; something it isn't normally required to do. If it doesn't do this it won't know which response to retransmit when it receives a replayed request. This correlation information must exist in the request and response messages using WS-Addressing's wsa:MessageID and wsa:RelatesTo header elements (I've never heard anyone propose any other way of doing it) thus creating a dependency between WS-RM and WS-Addressing where none existed before. Entries in this "correlation table" (speaking abstractly) can only be removed when the server-side WS-RM handler receives an acknowledgment for the response. Obviously you don't want this table to keep growing forever, so you can't create entries for requests that won't have a response. As with the client side, the server-side WS-RM handler must now know the exchange pattern in effect for each request it receives. The abstraction layer violation that exists on the client side exists on the server side as well. On top of this you have the additional, per-message (both request and response) overhead of referencing and updating the correlation information.
No Advertisement or Agreement
Web services are rooted in the concept of design by contract. Services indicate that clients may (or are required to) use standards such as WS-Addressing or WS-Security through the use of WS-Policy assertions in their WSDL documents. The replay model has no WS-Policy assertions to indicate its use, nor are there any other mechanisms defined that would allow a client to determine if a service does or doesn't support the use of replay. Considering the problems described above, it shouldn't come as a surprise that most web service stacks do not implement the replay model. So, given that there are stacks that don't support replay and taking into consideration that those that do may do so on an optional basis, it seems that the only way to know whether replay is going to work for you, as a client, is to call or email the administrator of the service and ask. If there are no alarm bells going off in your head at this moment, you haven't spent enough time in IT operations. "Interoperation by alignment of externally invisible configuration settings" has been shown to be operationally inscalable.
This problem exists on the flip-side as well. How does a service know whether a client intends to use replays? The article referred to above defines some rules whereby the server can use a combination of various values in the wsrm:CreateSequence message to infer that replay is in effect. To be clear, though, replay is an extension to WS-RM and it might not be the only extension to use that particular combination of values. Inferring the use of an extension through the values of particular, general purpose elements is risky and likely to cause interoperability problems. It would have been much better if replay defined an extension to the CreateSequence message and/or a unique SOAP header to signal to the server that the client intended to use replay.
Limited Applicability
If you've been following the conversation so far you've noticed that the replay model is only necessary for reliable request/response exchanges between a non-addressable client and a service. It is not needed for reliable one-way exchanges from a non-addressable client because there is no reliable response to worry about. But what about other kinds of patterns? A common paradigm in distributed computing is "publish and subscribe". Suppose a non-addressable client wants to subscribe to a series of event notifications that need to be delivered reliably? The exchange pattern might be termed "request-response-response-response . . ". Even if we assume that the subscription request is carried reliably (it might not be), it's obvious that the replay model will not help the publishing service retry lost notification messages. How would the client even know that it hadn't received a notification message? There are also situations in which a client might engage in a non-reliable request/reliable response exchange with a server. Since the request message is not processed by the server's WS-RM layer, the request-to-response mapping necessary for the replay model to work will not exist, and replay will not work. Additionally, since the request message is not filtered by the WS-RM layer, any replayed requests will be dispatched to the application.
Some of the above stuff is pretty advanced and it's hard to imagine how any of it would work with or without reliability (sending a series of notification messages to a non-addressable client?). I wouldn't have brought it up if there weren't a way of addressing the "reliable response to a non-addressable client" issue that also addresses all of these exchange patterns; (you guessed it) WS-MakeConnection.
Summary
This (rather lengthy) article has presented some of the technical issues with implementing and using the replay model. There are other, non-technical issues, including the fact that the replay model has not been approved by any recognized standards organization and actually violates the WS-RM standard, that should give pause to anyone attempting to use this approach to solving the problem of reliably responding to a non-addressable clients. As is obviously apparent, we at BEA think that the WS-MakeConnection protocol not only addresses the reliable request/response scenarios in a way that is far less problematic than replay, it also addresses a number of other scenarios of interest to our customers.
The Straight Stuff on WS-ReliableMessaging
Posted by gpilz on May 18, 2006 at 11:57 AM | Permalink
| Comments (1)
The Straight Stuff on WS-ReliableMessaging
Eagerly awaited yet widely misunderstood, Web Services Reliable
Messaging (WS-ReliableMessaging) “describes a protocol that
allows messages to be delivered reliably between distributed
applications in the presence of software component, system, or
network failures”. Version 1.0 of the specification [1]
was published by Microsoft, IBM, BEA and TIBCO. At the time of this
writing, version 1.1 is under development by the OASIS
Web Services Reliable Exchange (WS-RX) Technical Committee.
Although the basic purpose of this specification is clear enough,
there are a number of misconceptions about its fundamental nature.
Where's the Boeuf?
Like most of the WS-* specifications, the surprising thing about
WS-RM (to further abbreviate WS-ReliableMessaging) is how little it,
well, specifies. This sounds negative, but it is actually one
of the strengths of the WS-* specifications (see Secure,
Reliable, Transacted Web Services [2] for
an explanation of why discrete, minimal specifications are a good
thing). WS-RM specifies a SOAP protocol for sending a message and
getting an acknowledgment when that message is received. WS-RM also
specifies the resending of messages that have not been acknowledged.
That's it.
To understand why such a simple concept has value we need to think
about the problem that WS-RM is designed to address. Suppose we have
a piece of software that communicates with another piece of software
to achieve some business function; sending an invoice, for example.
This interaction is inherently asynchronous. That is, we don't expect
to get an immediate response to the invoice because we know that
there is a certain amount of processing (perhaps involving people)
that needs to occur before the invoice is accepted. Furthermore, we
know that these two systems won't communicate with one another
directly. Instead the messages they exchange will be routed through a
series of intermediaries such as service hubs, security gateways,
etc. As the sender, one of our immediate and primary concerns is
whether the invoice made it through all those hops and reached its
intended recipient. Previous solutions to this problem, such as the
Rosettanet
Implementation Framework, have tended to bake the solution into
the application protocol. WS-RM solves this problem in a generic
manner that allows the solution to be applied to multiple protocols.
A
Little More Detail
To really understand the limits and
capabilities of WS-RM it's necessary to know a little bit more about
how it works. Figure 1 below is lifted directly from the
WS-ReliableMessaging specification [3]. It
illustrates two entities, an Application Source (AS) and an
Application Destination (AD) communicating via WS-RM. The AS sends a
message to the RM Source (RMS). The RMS then transmits the message to
the RM Destination (RMD) using the WS-RM protocol. Once it has
received the message, the RMD delivers the message to the Application
Destination (AD).

As part of the protocol the RMD sends
acknowledgments of the messages it has received back to the RMS. For
its part the RMS is responsible for holding onto messages and
retransmitting them until it receives an acknowledgment from the RMD.
The interactions between an RMS and RMD occur within the context of
protocol-level sessions termed “Sequences”. Any message
sent from the RMS to the RMD can be uniquely identified by the ID of
the Sequence in which it was sent and the number of the message
within that Sequence.
To relate this model to our invoice delivery example; our
invoicing system is the Application Source and our customer's
purchasing system is the Application Destination. The RMS and RMD
nodes are components of our respective WS-RM implementations.
The
Asynchronous Sweet Spot
Returning to our invoice delivery example, you will remember that
we described an asynchronous interaction. Although WS-RM adds some
value to synchronous interactions it provides the most value in the
case of the asynchronous interactions. It's not difficult to see why
this is so. In general, synchronous operations specify fairly short
timeouts (minutes rather than hours). Any failure between the
requesting system and the providing system will be discovered in
short order. On the other hand, suppose the agreement between us and
our customer specifies a maximum turnaround time for invoices of 72
hours. If our invoice fails to reach our customer's system it's
possible that it will take three days to discover this. The value of
receiving timely acknowledgments (or an exception if the message is
never acknowledged) is obvious.
What's
In a Name?
Some of the confusion around WS-RM stems from the word “reliable”.
WS-RM doesn't do anything to increase the intrinsic reliability of
your software or its underlying infrastructure. WS-RM smooths over
temporary network and service outages but if, for example, you lose
your network for 24 hours there is nothing WS-RM can magically do to
get your messages to their destination. This is enormously important
because it means that using WS-RM does not free you from having to
write the exception handling logic to deal with the cases where your
messages do not reach their destination. WS-RM allows you to simplify
this logic since you can be certain that, prior to triggering an
exception, WS-RM has already made every attempt to send the message.
In addition to this, WS-RM specifies that messages are
acknowledged when they are received by the RMD, not when they
are delivered to the AD. At the point when an RMS receives an
acknowledgment for a message you cannot be absolutely certain that
the AD got that message. However, this line of thinking quickly gets
digressive. Even if you did know that the message had been delivered
to the AD, you couldn't be sure that the AD didn't subsequently crash
before it could save that message, etc. One party of a distributed
interaction can never be absolutely certain of what is happening to
the other party in that interaction unless it receives some form of
signal from that party. WS-RM could
have defined an entire set of acknowledgments (i.e. message
received, message delivered, message validated, message persisted),
but it's not clear that the need for this kind of “acknowledgment
framework” is great enough to justify the added complexity.
What WS-RM does is solve the most pressing issue, namely “did
this message make it through to the 'other side'”? It does this
in the simplest way possible by acknowledging the message when it is
received by the RMD. Baring events such as an OS crash, disk failure,
etc. applications can reasonably expect that the RMD will eventually
deliver the message to the AD.
Sessions
A common misconception about WS-RM is that it is “TCP at the
SOAP level” [4]. Although there are
similarities between WS-RM and TCP, they are more different than they
are alike. Foremost amongst these differences is the level of session
support provided by the two technologies. One of TCP's main purposes
is to provide a “communications session” between two
applications. This is not one of the goals of WS-RM. While it
is true that a WS-RM Sequence is a kind of session, the purpose of a
Sequence is to provide a scope for the messages and acknowledgments
exchanged between the RMS and RMD. The WS-RM specification says
nothing about exposing Sequences to the AS or AD, nor are
there any guarantees about the behavior an AS or AD may expect from a
Sequence. There isn't even a guarantee of a one-to-one relationship
between applications (AS-AD pairs) and Sequences. Some vendor's WS-RM
architectures implement the RMS and the RMD as independent gateways
that multiplex several “application sessions” over a
single Sequence. The bottom line is that, if your WS-RM
implementation exposes the Sequence and you use that Sequence as an
application-level session, don't be surprised if your code won't
inter operate with code that uses a different WS-RM implementation.
Assurances
(and the Lack Thereof)
One of the more difficult aspects of the WS-RM specification is
its treatment of “delivery assurances”. Roughly speaking
a delivery assurance is a contract to deliver a message to the AD
only when certain conditions have been met. For example, “exactly
once” refers to an assurance wherein the RMD will deliver a
particular message (defined by a its Sequence ID and message number)
to the AD once and only once. Duplicate messages (the result of retry
attempts, network hiccups, lost acknowledgments, etc.) will be
dropped by the RMD. “In order” is an assurance wherein
the RMD delivers messages to the AD in the same order that the AS
sent them to the RMS.
By its nature the WS-RM protocol is well suited for supporting “at
least once”, “exactly once”, as well as ordered and
non-ordered versions of both of these assurances. However, for
reasons we don't have the time to go into, WS-RM treats any
designation of these assurances as a local contract between the AD
and the RMD. Within the limits of the WS-RM specification, neither
the AS nor the RMS is capable of discovering the details of this
contract either at runtime or via the service description.
Appendix:
References
[1]
http://specs.xmlsoap.org/ws/2005/02/rm/ws-reliablemessaging.pdf
[2]
http://www-128.ibm.com/developerworks/webservices/library/ws-securtrans/index.html
[3]
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ws-rx#technical
[4]
http://blogs.msdn.com/shycohen/archive/2006/02/20/535717.aspx
|