…generate a buzz in Raleigh and Zurich.
This post will recount two parallel stories in Rashomon style that transpired last week. One version is from the perspective of a user in Shanghai, China and the other is b2b2dot0’s “behind the scenes” perspective in both our US and European headquarters.
The Shanghai perspective
On Saturday morning January 10th at about 7AM local Shanghai time, user JChan wanted to check up on an order that he had placed with the US earlier that week. He proceeded to do what he has done scores of times over the past 4 months, he logged into our b2b2dot0 service and asked to see the details of a specific order as they were stored in our client’s SAP system. JChan’s first mouse click was on the “most recently placed web order” link on his home page (#1 in the following screenshot).
Since it was taking longer than usual to get a response from the system, JChan wanted to make good use of his waiting time and his second click was to innocently close the “What’s New” window (click #2 in the above screen shot) to free up a little screen real estate for himself. A few seconds later, the system returned all the information that JChan had requested and he soon thereafter logged off of the system to carry on with the rest of his day. Since it was a Saturday, I can only hope it was to spend time with his family.
Before we move on to the b2b2dot0 perspective of this session, a quick word about Internet response times in China. They simply aren’t as fast as they are in the rest of the world and the local population (especially those that travel to the west) have re-calibrated their expectations accordingly.
We can thank the recent 2008 Beijing Olympic games for exposing the state of the Chinese internet to the rest of the world. Apparently the “Great Chinese Firewall”
intrudes on all internet traffic in and out of the country, imposing a
heavy tax on performance.
The b2b2dot0 perspective on JChan’s two mouse clicks.
Shortly after JChan’s second mouse click, our production server monitors generated an exception alert notification with the following subject line:
[EXC PRODUCTION] purchasing#show (Mongrel::TimeoutError) “Mongrel timed out this thread: shutdown”
The keywords to focus on here are “TimeoutError” and “shutdown”. Neither of these are comforting words to see as we in Raleigh were about ready to head out for our Friday evening festivities and since our client had already closed up shop for the weekend.
The good news is that our architecture anticipates an occasional mongrel overload and was able to gracefully recover for JChan. So once we determined that JChan didn’t experience anything out of the ordinary, we started the “post mortem” to understand what happened and see if we could prevent its recurrence in the future.
In true CSI fashion, we recreated JChan’s experience from our application logs. This is what Adrian Zehnder (from Switzerland) was able to tell us from a reconstruction of JChan’s session:
– 17:10:38 EST: user jchan logs in
– 17:10:44 EST: user jchan clicks to display sales doc 190054182: it
triggers a web sv call
– 17:10:44 EST: mongrel 3001-2 sends the usual 1st ping and receives
the usual 401 response
– 17:10:45 EST: mongrel 3001-2 sends the usual 2nd ping which never
receives a response. This must have caused the mongrel to hang.
– 17:10:47 EST: user jchan closes the news section
– 17:11:33 EST: user jchan clicks to display sales doc 190054182: it
triggers a web sv call (which is handled by mongrel 3002-1 with no
– 17:22:42 EST: after some other stuff, user jchan logs out
– 18:14:00 EST Justin receives a notification from nagios about the
– 19:17:54 EST user jchan gets a web service exception for the call
from 17:10:45 EST, because mongrel timed out the thread. This triggers
a notification which fails because the server hangs ???
– 19:18:00 EST another exception is logged under user jchan which
triggers the notification (about the failing of the prior notification)
that we receive
– I’m not sure whether mongrel 3002-1 works, because there’s nothing
Adrian goes on to ask:
1. Why is there no web service timeout after 60s somewhere around
17:11:44 EST ?
2. Why is Justin not notified before 18:14 ?
3. Does the fact that user jchan closes the news section 3s after
clicking the recent web order link for 190054182 have any impact ?
4. What is happening at 19:17:54 ? Did one of you guys terminate the
5. Why do we get notified at 19:18:00 ? Is mongrel 3002-1 working again
and processing the notification ? I don’t understand how that would
These questions spawned quite a few internal b2b2dot0 conversations and some process improvements. We also logged a requirement to close the “hole” on the home page that allowed a second mouse click before the results of the first click are returned to the user.
Thanks to JChan’s two mouseclicks in Shanghai, we had an exciting evening in Raleigh, and an action packed weekend in Zurich. We can only hope that he went on to extend what was a productive start to his Saturday, to the rest of his weekend.
We’re just happy that JChan accomplished his goal on our website, our client was never disturbed, and we learned a few things to make our service even more robust in the future!
Just another day in the life of a Software-as-a-Service provider.