Recently, we’ve been assigned to assist another company in a development/maintenance role. They had an existing web application, written in Java and used by a large number of users— the web application was up and running and they needed to modify the application to interface with an external system, but they did not have the people who knew Java well enough onboard and the previous developers were no longer available to help them in this. That’s where we came in: we provided the needed skills to add the feature in.

To interface with them, the external system provided a web service interface (written on the .NET platform) which we had to call in a remote method fashion. That is, certain transactions on the web site had to call a web service method and await its reply before saving the transaction data into the database. We couldn’t perform these web service requests asynchronously, a point I’ll come back to later.

To add to the pressure, we had four days to complete the code to interface with the web service (as well as an additional, but minor, feature), and two weeks of “transition time”, where the new web application would go live straight into production, with live data, but users would not be charged for transactions they’ve made during the time.

There was three of us on the team: me, Butch, and Miguel. We decided right off the bat to use Spring’s remoting services for calling the external web service. We got everything up and running quickly on our development server, and we were waiting for the client’s team to ready the production system for the turnover. However, Murphy decided to pay us a visit: there was a planned database switchover where the live production site would be made to point to a standby database instance while the live database would be upgraded, but things went south quite quickly.

Anyway, we eventually got the whole shebang up and running the next day, with Murphy still breathing down our necks. This time, we found the application dying: within an hour or two of use by users, access to the application would slowly grind to a halt until nobody could access the application at all, and the application server (OC4J in this instance) had to be restarted.

We set out to figure out this particulary nasty bit of news, and why it was happening. After about halfway into the transition time (and about two to three app server restarts a day, disastrous) we found out why things were happening, as it was.

It turns out that Axis 1.x, which we were using under the hood of the Spring JAX-RPC proxy bean creator as the JAX-RPC provider, uses a fire-and-forget HTTP transport. This means that under high concurrent load, the default axis HTTP transport would eventually exhaust all connections, as a lot of the finished HTTP transactions will be idling in the CLOSE_WAIT or TIME_WAIT, and the time for these to transition to fully closed isn’t short enough for sockets to be reused.

To give an idea of the connection loads we were seeing, the web application was handling about 20,000 to 30,000 transactions a day, from 9am to 6pm, with traffic peaking from 10am to 12pm; almost little to no traffic occurs outside of these times. Although this isn’t spectacular (web sites that have been Slashdotted could get traffic a magnitude or two greater), the situation was exacerbated by latency: it took from one to two seconds for the external web service to finish, per method, and for each web transaction we were performing two web service method calls.

(Aside: Since me and Butch were more conversant with monitoring and adminstering Linux servers than the Windows server that the web application was installed on, and since the only interaction we had with the server was a VNC connection to the machine (where it was coloc’d), we had a tough time trying to piece together enough data to actually get to that conclusion. Really. And now my favorite tool on Windows is PERFMON— even if I don’t run Windows, at all. But I digress.)

Apparently, Axis 1.x also supports another HTTP transport that uses the Commons HTTPClient library, which in turn does HTTP connection pooling. Telling Axis to use the Commons HTTPClient instead is well documented here, so I won’t go into it. However, for reasons beyond me at the moment, the remote web service (being served by IIS, and possibly behind an ISA server) refused the Commons HTTPClient connections because of chunked encoding (i.e. the Commons HTTPClient wanted to send HTTP transactions in chunks and told the server by issuing “Transfer-encoding: chunked” in the initial headers). As I found out from here, there is a workaround, which I had to implement via subclassing of Spring’s JaxRpcPortProxyFactoryBean, where I overrode postProcessJaxRpcCall:

    public void postProcessJaxRpcCall(Call call, MethodInvocation method) {
        super.postProcessJaxRpcCall(call, method);
        Hashtable ht = new Hashtable();
        ht.put("chunked", "false");
        call.setProperty("HTTP-Request-Headers", ht);
    }

I also used the subclass to fix a particular concurrency bug in the JAX-RPC proxy bean that Spring has which we were hitting.

So that basically saved the day. We did some stress testing before we put it into production, we observed the number of TCP connections established. Without the use of the Commons HTTPClient sender, the graph of network activity was rising linearly, with occasional dips as the load testing machine’s activity dropped off (where the load tester’s own threads were hogging each other for CPU time). With the Commons HTTPClient in place, the graph was quite flat, reaching a natural peak level without moving above that (of course, our concurrent load was constant, hence the ceiling).

The moral of the story? When you’re about to put a site live, Murphy will be knocking.

Previously: Ping