Arch2Arch Tab BEA.com

Gilbert Pilz's Blog

Technology: Web Services Archives



Replay Reconsidered

Posted by gpilz on April 25, 2008 at 9:35 AM | Permalink | Comments (4)

WS-ReliableMessaging describes a protocol that allows SOAP messages to be delivered reliably between distributed applications in the presence of software component, system, or network failures. One issue that has long bedeviled WS-RM is how to support reliable responses to so-called "anonymous clients". The OASIS WS-RX Technical Committee created the WS-MakeConnection specification to deal with this issue. Another, alternate solution is the use of the "replay model". This article describes the technical defects of this model.

It is assumed that readers of this article are familiar with the basic principals and operation of the WS-ReliableMessaging protocol. If you are less than familiar with WS-RM, this Wikipedia entry is a good place to get started.

Core Dilemma

The core dilemma behind this issue is that "anonymous clients" (I prefer the term "non-addressable clients" because I don't like to conflate the concepts of addressability with those of identity) can only communicate synchronously yet WS-RM, by its nature, potentially renders all communications asynchronous. Uh huh. Let's break that down a bit.

Non-addressable clients are hosted on computers that, for reasons of network topology (i.e. NATs), security (i.e. firewalls), or whatever, cannot accept connections from systems outside their network. Although you can't connect to these machines from the outside, they themselves can create outbound connections. SOAP supports non-addressable clients by leveraging HTTP to take advantage of this fact. Non-addressable SOAP clients create an outbound connection to a server, send the request message over this connection, then read the corresponding response from that same connection (this response channel is sometimes referred to as "the HTTP back-channel"). This is why non-addressable clients operate synchronously. They have to use the connection they created to read the server's response because, by definition, it is impossible for the server to connect to them and send the response (as would happen in an asynchronous exchange). For readers accustomed to thinking in terms of synchronous communication this all seems par for the course, but wait, there's more.

WS-RM is built on the concepts of acknowledglements and retransmissions. One node (client, server, whatever) sends a message to another and waits for an acknowledgement. If it doesn't receive one it assumes the message didn't get through and sends it again. So, regardless of when you think you are going to receive a message and which connection you think you are going to receive that message over, something may go wrong (the connection might break) and WS-RM will retransmit the message at a later time over a different connection. This doesn't present a problem for non-addressable clients on the request side (where they control the creation of new connections) but it is a problem on the response side. Suppose you are a server in the process of sending a reliable response to a non-addressable client and the connection goes down. Obviously you are never going to get an acknowledgment for that response message so, as a WS-RM node, it is your responsibility to resend it. But how are you supposed to do that? You can't connect to the client and re-send the response because the client is not addressable.

Replay Redux

As I said earlier, the OASIS WS-RX Technical Committee created the WS-MakeConnection specification as a means of addressing this problem. WS-MakeConnection is a very important piece of technology as I will explain in a later article. Another solution that predates the work of the WS-RX TC is the use of "replays". The best description of the replay model is this whitepaper by WS02. Although this article describes the use of replay in the context of WS-RM 1.0, some implementations (most notably Microsoft® Windows Communication Foundation (WCF)) have extended this solution to include WS-RM 1.1. Replay takes advantage of the fact that non-addressable clients can create new outbound connections and uses the retransmission of a (possibly acknowledged) request to solicit the retransmission of the corresponding response. On the surface these seems like a reasonable approach but, as I will show, there are a number of serious technical issues around its implementation and use.

Abstraction Layer Violations

One of the most serious issues with the implementation of the replay model is that it requires the RMS to be aware of the message exchange pattern of the messages it processes. To understand why this is so we need to review the normal processing sequence for an RMS. An RMS receives a message from the higher-level Application Source (AS). The RMS then transmits the request message to the RMD. Since the RMS is responsible for re-transmitting the request message it must store that message (in memory and/or on disk) until it receives an acknowledgment from the RMD. When the acknowledgment is received the RMS can "forget" about the message. Not so when replay is in effect. Because the replay model uses request messages as a prompt for lost response messages, the RMS must store requests until the corresponding response as been received even after the request itself has been acknowledged. But wait, what if there is no response message? What if the request message is the sole message in a one-way exchange? We obviously can't have the RMS storing these one-way messages forever, so the RMS needs to know whether the message it is processing is part of a request-response exchange or a one-way message.

OK, why is this such a big deal? To understand why this is an issue we need to think about the basic architecture of SOAP and the composability of web service specifications. One of SOAP's big claims is that you can add additional facilities (like reliability) in a way that is transparent to both the application and to any other facilities. Underlying this assertion is the notion that most SOAP stacks will implement some form of the chain of responsibility pattern. This means that the only parts of the SOAP processing pipeline that should be aware of the exchange pattern being used are the initiator and the ultimate receiver. Requiring the handler that implements WS-RM to know the exchange pattern in effect for the messages it handles runs counter to this entire architecture. Does that mean you couldn't hack around this problem in some way? Of course you could! But these kind of hacks are likely to work only in specific instances (i.e. when the WS-RM processor and the initiator share the same process space, etc.) and will, ultimately, lead to a SOAP stack that is buggy and fragile (or should I say "buggier and more fragile"?).

Request-Response Correlation

Another problem with implementing the replay model is the fact that the server-side WS-RM handler must maintain the correlation between the requests and responses it has processed; something it isn't normally required to do. If it doesn't do this it won't know which response to retransmit when it receives a replayed request. This correlation information must exist in the request and response messages using WS-Addressing's wsa:MessageID and wsa:RelatesTo header elements (I've never heard anyone propose any other way of doing it) thus creating a dependency between WS-RM and WS-Addressing where none existed before. Entries in this "correlation table" (speaking abstractly) can only be removed when the server-side WS-RM handler receives an acknowledgment for the response. Obviously you don't want this table to keep growing forever, so you can't create entries for requests that won't have a response. As with the client side, the server-side WS-RM handler must now know the exchange pattern in effect for each request it receives. The abstraction layer violation that exists on the client side exists on the server side as well. On top of this you have the additional, per-message (both request and response) overhead of referencing and updating the correlation information.

No Advertisement or Agreement

Web services are rooted in the concept of design by contract. Services indicate that clients may (or are required to) use standards such as WS-Addressing or WS-Security through the use of WS-Policy assertions in their WSDL documents. The replay model has no WS-Policy assertions to indicate its use, nor are there any other mechanisms defined that would allow a client to determine if a service does or doesn't support the use of replay. Considering the problems described above, it shouldn't come as a surprise that most web service stacks do not implement the replay model. So, given that there are stacks that don't support replay and taking into consideration that those that do may do so on an optional basis, it seems that the only way to know whether replay is going to work for you, as a client, is to call or email the administrator of the service and ask. If there are no alarm bells going off in your head at this moment, you haven't spent enough time in IT operations. "Interoperation by alignment of externally invisible configuration settings" has been shown to be operationally inscalable.

This problem exists on the flip-side as well. How does a service know whether a client intends to use replays? The article referred to above defines some rules whereby the server can use a combination of various values in the wsrm:CreateSequence message to infer that replay is in effect. To be clear, though, replay is an extension to WS-RM and it might not be the only extension to use that particular combination of values. Inferring the use of an extension through the values of particular, general purpose elements is risky and likely to cause interoperability problems. It would have been much better if replay defined an extension to the CreateSequence message and/or a unique SOAP header to signal to the server that the client intended to use replay.

Limited Applicability

If you've been following the conversation so far you've noticed that the replay model is only necessary for reliable request/response exchanges between a non-addressable client and a service. It is not needed for reliable one-way exchanges from a non-addressable client because there is no reliable response to worry about. But what about other kinds of patterns? A common paradigm in distributed computing is "publish and subscribe". Suppose a non-addressable client wants to subscribe to a series of event notifications that need to be delivered reliably? The exchange pattern might be termed "request-response-response-response . . ". Even if we assume that the subscription request is carried reliably (it might not be), it's obvious that the replay model will not help the publishing service retry lost notification messages. How would the client even know that it hadn't received a notification message? There are also situations in which a client might engage in a non-reliable request/reliable response exchange with a server. Since the request message is not processed by the server's WS-RM layer, the request-to-response mapping necessary for the replay model to work will not exist, and replay will not work. Additionally, since the request message is not filtered by the WS-RM layer, any replayed requests will be dispatched to the application.

Some of the above stuff is pretty advanced and it's hard to imagine how any of it would work with or without reliability (sending a series of notification messages to a non-addressable client?). I wouldn't have brought it up if there weren't a way of addressing the "reliable response to a non-addressable client" issue that also addresses all of these exchange patterns; (you guessed it) WS-MakeConnection.

Summary

This (rather lengthy) article has presented some of the technical issues with implementing and using the replay model. There are other, non-technical issues, including the fact that the replay model has not been approved by any recognized standards organization and actually violates the WS-RM standard, that should give pause to anyone attempting to use this approach to solving the problem of reliably responding to a non-addressable clients. As is obviously apparent, we at BEA think that the WS-MakeConnection protocol not only addresses the reliable request/response scenarios in a way that is far less problematic than replay, it also addresses a number of other scenarios of interest to our customers.



The Straight Stuff on WS-ReliableMessaging

Posted by gpilz on May 18, 2006 at 11:57 AM | Permalink | Comments (1)

The Straight Stuff on WS-ReliableMessaging

Eagerly awaited yet widely misunderstood, Web Services Reliable Messaging (WS-ReliableMessaging) “describes a protocol that allows messages to be delivered reliably between distributed applications in the presence of software component, system, or network failures”. Version 1.0 of the specification [1] was published by Microsoft, IBM, BEA and TIBCO. At the time of this writing, version 1.1 is under development by the OASIS Web Services Reliable Exchange (WS-RX) Technical Committee. Although the basic purpose of this specification is clear enough, there are a number of misconceptions about its fundamental nature.

Where's the Boeuf?

Like most of the WS-* specifications, the surprising thing about WS-RM (to further abbreviate WS-ReliableMessaging) is how little it, well, specifies. This sounds negative, but it is actually one of the strengths of the WS-* specifications (see Secure, Reliable, Transacted Web Services [2] for an explanation of why discrete, minimal specifications are a good thing). WS-RM specifies a SOAP protocol for sending a message and getting an acknowledgment when that message is received. WS-RM also specifies the resending of messages that have not been acknowledged. That's it.

To understand why such a simple concept has value we need to think about the problem that WS-RM is designed to address. Suppose we have a piece of software that communicates with another piece of software to achieve some business function; sending an invoice, for example. This interaction is inherently asynchronous. That is, we don't expect to get an immediate response to the invoice because we know that there is a certain amount of processing (perhaps involving people) that needs to occur before the invoice is accepted. Furthermore, we know that these two systems won't communicate with one another directly. Instead the messages they exchange will be routed through a series of intermediaries such as service hubs, security gateways, etc. As the sender, one of our immediate and primary concerns is whether the invoice made it through all those hops and reached its intended recipient. Previous solutions to this problem, such as the Rosettanet Implementation Framework, have tended to bake the solution into the application protocol. WS-RM solves this problem in a generic manner that allows the solution to be applied to multiple protocols.

A Little More Detail

To really understand the limits and capabilities of WS-RM it's necessary to know a little bit more about how it works. Figure 1 below is lifted directly from the WS-ReliableMessaging specification [3]. It illustrates two entities, an Application Source (AS) and an Application Destination (AD) communicating via WS-RM. The AS sends a message to the RM Source (RMS). The RMS then transmits the message to the RM Destination (RMD) using the WS-RM protocol. Once it has received the message, the RMD delivers the message to the Application Destination (AD).

Frame1

As part of the protocol the RMD sends acknowledgments of the messages it has received back to the RMS. For its part the RMS is responsible for holding onto messages and retransmitting them until it receives an acknowledgment from the RMD. The interactions between an RMS and RMD occur within the context of protocol-level sessions termed “Sequences”. Any message sent from the RMS to the RMD can be uniquely identified by the ID of the Sequence in which it was sent and the number of the message within that Sequence.

To relate this model to our invoice delivery example; our invoicing system is the Application Source and our customer's purchasing system is the Application Destination. The RMS and RMD nodes are components of our respective WS-RM implementations.

The Asynchronous Sweet Spot

Returning to our invoice delivery example, you will remember that we described an asynchronous interaction. Although WS-RM adds some value to synchronous interactions it provides the most value in the case of the asynchronous interactions. It's not difficult to see why this is so. In general, synchronous operations specify fairly short timeouts (minutes rather than hours). Any failure between the requesting system and the providing system will be discovered in short order. On the other hand, suppose the agreement between us and our customer specifies a maximum turnaround time for invoices of 72 hours. If our invoice fails to reach our customer's system it's possible that it will take three days to discover this. The value of receiving timely acknowledgments (or an exception if the message is never acknowledged) is obvious.

What's In a Name?

Some of the confusion around WS-RM stems from the word “reliable”. WS-RM doesn't do anything to increase the intrinsic reliability of your software or its underlying infrastructure. WS-RM smooths over temporary network and service outages but if, for example, you lose your network for 24 hours there is nothing WS-RM can magically do to get your messages to their destination. This is enormously important because it means that using WS-RM does not free you from having to write the exception handling logic to deal with the cases where your messages do not reach their destination. WS-RM allows you to simplify this logic since you can be certain that, prior to triggering an exception, WS-RM has already made every attempt to send the message.

In addition to this, WS-RM specifies that messages are acknowledged when they are received by the RMD, not when they are delivered to the AD. At the point when an RMS receives an acknowledgment for a message you cannot be absolutely certain that the AD got that message. However, this line of thinking quickly gets digressive. Even if you did know that the message had been delivered to the AD, you couldn't be sure that the AD didn't subsequently crash before it could save that message, etc. One party of a distributed interaction can never be absolutely certain of what is happening to the other party in that interaction unless it receives some form of signal from that party. WS-RM could have defined an entire set of acknowledgments (i.e. message received, message delivered, message validated, message persisted), but it's not clear that the need for this kind of “acknowledgment framework” is great enough to justify the added complexity. What WS-RM does is solve the most pressing issue, namely “did this message make it through to the 'other side'”? It does this in the simplest way possible by acknowledging the message when it is received by the RMD. Baring events such as an OS crash, disk failure, etc. applications can reasonably expect that the RMD will eventually deliver the message to the AD.

Sessions

A common misconception about WS-RM is that it is “TCP at the SOAP level” [4]. Although there are similarities between WS-RM and TCP, they are more different than they are alike. Foremost amongst these differences is the level of session support provided by the two technologies. One of TCP's main purposes is to provide a “communications session” between two applications. This is not one of the goals of WS-RM. While it is true that a WS-RM Sequence is a kind of session, the purpose of a Sequence is to provide a scope for the messages and acknowledgments exchanged between the RMS and RMD. The WS-RM specification says nothing about exposing Sequences to the AS or AD, nor are there any guarantees about the behavior an AS or AD may expect from a Sequence. There isn't even a guarantee of a one-to-one relationship between applications (AS-AD pairs) and Sequences. Some vendor's WS-RM architectures implement the RMS and the RMD as independent gateways that multiplex several “application sessions” over a single Sequence. The bottom line is that, if your WS-RM implementation exposes the Sequence and you use that Sequence as an application-level session, don't be surprised if your code won't inter operate with code that uses a different WS-RM implementation.

Assurances (and the Lack Thereof)

One of the more difficult aspects of the WS-RM specification is its treatment of “delivery assurances”. Roughly speaking a delivery assurance is a contract to deliver a message to the AD only when certain conditions have been met. For example, “exactly once” refers to an assurance wherein the RMD will deliver a particular message (defined by a its Sequence ID and message number) to the AD once and only once. Duplicate messages (the result of retry attempts, network hiccups, lost acknowledgments, etc.) will be dropped by the RMD. “In order” is an assurance wherein the RMD delivers messages to the AD in the same order that the AS sent them to the RMS.

By its nature the WS-RM protocol is well suited for supporting “at least once”, “exactly once”, as well as ordered and non-ordered versions of both of these assurances. However, for reasons we don't have the time to go into, WS-RM treats any designation of these assurances as a local contract between the AD and the RMD. Within the limits of the WS-RM specification, neither the AS nor the RMS is capable of discovering the details of this contract either at runtime or via the service description.

Appendix: References

[1] http://specs.xmlsoap.org/ws/2005/02/rm/ws-reliablemessaging.pdf

[2] http://www-128.ibm.com/developerworks/webservices/library/ws-securtrans/index.html

[3] http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=ws-rx#technical

[4] http://blogs.msdn.com/shycohen/archive/2006/02/20/535717.aspx




Powered by
Movable Type 3.31