RIM's Excuse For BlackBerry Outage Finally Emerges

from the too-little-too-late? dept

Research In Motion has delivered an explanation of what caused the BlackBerry outage earlier this week — sort of. It says an insufficiently tested software upgrade set off a series of errors at its network operations center, which processes all the emails for BlackBerry devices in North America, and then its “failover process”, which is supposed to switch things to a backup system, didn’t work properly. The company says that it has plenty of capacity and resources to deal with its volume of messages and growing user base, and that it will better test its upgrades in the future. However, that explanation — and the long time it took to come out — doesn’t wash with some observers, who say there are enough holes in the story that it doesn’t add up. In particular, RIM’s contention that it was upgrading its software on a Tuesday night, rather than over a weekend, has raised some red flags. Then, if a scheduled upgrade was behind the problem, shouldn’t that have been immediately obvious to the company and news spread quickly by its PR team? The real damage from this episode won’t be the outage itself, but rather the fallout from how RIM deals with it. On that front, things already aren’t looking so good.


Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “RIM's Excuse For BlackBerry Outage Finally Emerges”

Subscribe: RSS Leave a comment
20 Comments
Anonymous Coward says:

It doesn't matter.

No matter what you do, occasional screwups will happen. I’m an anal retentive software developer who believes in testing first and foremost, yet there are things which can get past testing and QA simply because real world stress is different than your testing process can anticipate in many cases. Edge case combinations of issues are the bane of all software/hardware developers because they can not properly test for such things up front in all cases.

It is very possible that they released a minor patch to fix something and that caused a cascade failure when released into the wild. An unexpected and untestable side effect of the “minor” patch screwed up many other things. This is a very common issue when it comes to “minor” items that blow up in your face.

I’m not trying to say RIM doesn’t have to properly answer “WHY” this happened, I’m just stating that at this “point” they could still be doing indepth data analysis and simply posting the “general” result of what they have found so far. I’ve done live patches in the past and while I’ve never had a cascade failure I’ve always known it was possible. (100K’ish subscriber level, not anywhere near the level of RIM.) I know that some day, some time, in some way I will miss some small detail and cause a cascade failure, it WILL happen.

So, understanding that there is never going to be 100% uptime, an outage of this type is deplorable yet a reality of large systems engineering. RIM “could” be a bit more upfront about what has gone wrong but on the other hand they very well could be scatching their heads over just what “really” caused the problem.

Now, personally, understanding such things, I would prefer that a company is up front about the reason for the down time. As a technically inclined person, and more importantly, one of the folks who would be asked to justify usage of XYZ system over others, I would not want this sort of generic response which doesn’t make a lot of sense to be common. I would want straight answers to the problems and what they are doing to fix them, that’s more important than denying there is a problem.

KB

Fred Flint says:

Just Saving a Few Bucks

Like most large Canadian companies (like banks), I’m sure RIM regularly fires their experienced and competent staff because such people are expensive.

I’m also sure RIM hires students fresh from school, supposedly because they are most up-to-date on the technology and of course, they work real, real cheap.

Of course, as soon as the students start figuring out what the hell they’re doing, they want more money and of course, they get fired and a new flock of fresh-faced students gets hired.

This is a dirty little I.S. secret that’s been true for many, many years.

From years of observation, my best guess about the outage is that it was caused by some new blockhead student who didn’t know a bit from a byte but decided to “fix” something anyway.

Peter says:

re: post # 6 & 7

Do some more research on how this whole Blackberry thing works. BES (Blackberry Enterprise Server) is just the facility to allow connectivity to your local (behind the firewall) resources.

As for post 7… RIM is, in my experience, one of the more reliable technology companies out there. I don’t manage any other systems that are as reliable and low maintenance as theirs.

And no, I don’t work for them or have any particular investment. Just a very happy customer.

Fred Flint says:

Re: re: post # 6 & 7

As for post 7… RIM is, in my experience, one of the more reliable technology companies out there. I don’t manage any other systems that are as reliable and low maintenance as theirs.

Well, your argument convinced me.

I guess RIM doesn’t “downsize” the well-paid, experienced staff and hire a bunch of inexpensive bozos, fresh from college.

I guess their system didn’t go down unexpectedly for no honestly explained reason.

I guess someone actually did provide for a reliable back-up system and someone actually did institute effective change controls and other basic I.S. common sense activities but, dammit, they just didn’t work.

I hate devastating arguments like yours. They make me feel so uninformed and inexperienced.

Peter says:

Re: Re: re: post # 6 & 7

Fred… Apologies for apparently stating my opinion as fact. My intent wasn’t to make anyone feel uninformed or inexperienced, but simply to offer a counter point to the inevitable corporate bashing that occurs here. You obviously have some insight into the inner workings of RIM that none of the other posters here could hope to match. I defer herewith to your superiority.

For clarification, in referring to “any other systems” I realize I was not being accurate. There are other systems that are as reliable and low maintenance… they are, however, few and far between.

Fred Flint says:

Re: Re: Re: re: post # 6 & 7

Peter,

I appreciate your response and I will admit I sometimes go berserk when I experience the cavalier attitude of corporations and other large business entities when it comes to Information Systems.

There are some pretty simple, well-known procedures to follow that will limit unscheduled downtime to something like 0.1 percent. That used to be the target for most mainframe shops and they met the target regularly – or they got fired.

Not so, these days. For instance, my cable ISP seems to simply turn off their service, accidentally or on purpose, any time they feel like it; no warning, no apologies, no refund. It happens a lot. Hooray for monopolies!

It is unfortunate when arrogance, greed and stupidity cause senior management to sacrifice dedication and professionalism on the alter of The Bottom Line.

Worse, they usually adversely affect The Bottom Line, then blame it on the I.S. staff.

pickford says:

RIM and BB are fighting an uphill battle. There are devices that do exactly what BB does, only better. I am a network admin and we have more BB issues than we do Treo issues. Personally, I wold rather have a WM OS on my Treo 700 and use exchange’s built in active sync than install a buggy, costly middle man like BES.

Anonymous Coward says:

Re: Re:

I’ve never seen a WM device push emails as fast as a BB. Are you talking about using MS exchange server? Or the provider’s email service? Either way, the BES has only failed me this one time in four years and I was still able to access my email thru the web using Opera-mini on my device so I didn’t miss out on much.

Nick Rao says:

Blackberry Outage

Deploying software updates in the middle of the week would be classified as “worst in class”. Inadequate testing is another example of an organization that does not have a solid process for developing and testing code. If these folks are the gate keepers of critical corporate communications worldwide, then corporate clients must demand information on plans to address the systemic issues, not just an “opps, we’ll do better next time”. OBTW, these types of process problems takes months, if not years to resolve.

pickford says:

@April 21st, 5:31PM – I am talking about M$ Exchange. I was told when I got my WM based Treo that it would not push as fast as BES, wrongo. Recently on a business trip with the director (who uses BB) a mass email was sent out and our devices alerted us of it at the same exact time. Also, while we were out there, an automated process I have setup in case of BES failing went off. So I received a message letting me know that the BES service had gone down and failed to re-start. I then, on my Treo, in the car, logged into my Exchange server and got the service running.

factor into that the fact that RIM charges for upgrades to BES, Windows updates active sync free of charge.

Also, syncing with a desktop is MUCH easier and error free with AS than with Desktop Manager.

MO says:

RIM Architecture

Folks:

Any system that goes down for 10 hours (April ’07) and then 6 hours (yesterday) is poorly architected.

For all the BBY fanatics, remember there are added layers of complexity (mail -> mailserver -> BES -> RIM -> PDA) and that is not sound, despite the professed advantages.

The “Evil Empire” got it right with ActiveSync (mail -> mailserver -> PDA). When all the Crackberry addicts were wondering what the hell was going on I was getting my e-mail just fine.

And then there’s the aborted failover attempt… guess they didn’t give that process it due dilly, eh?

If RIM had any clue they would fire whoever was responsible for the “failed upgrade” (yeah, right) and lame-ass failover.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...