RIM's Excuse For BlackBerry Outage Finally Emerges

from the too-little-too-late? dept

Research In Motion has delivered an explanation of what caused the BlackBerry outage earlier this week -- sort of. It says an insufficiently tested software upgrade set off a series of errors at its network operations center, which processes all the emails for BlackBerry devices in North America, and then its "failover process", which is supposed to switch things to a backup system, didn't work properly. The company says that it has plenty of capacity and resources to deal with its volume of messages and growing user base, and that it will better test its upgrades in the future. However, that explanation -- and the long time it took to come out -- doesn't wash with some observers, who say there are enough holes in the story that it doesn't add up. In particular, RIM's contention that it was upgrading its software on a Tuesday night, rather than over a weekend, has raised some red flags. Then, if a scheduled upgrade was behind the problem, shouldn't that have been immediately obvious to the company and news spread quickly by its PR team? The real damage from this episode won't be the outage itself, but rather the fallout from how RIM deals with it. On that front, things already aren't looking so good.


Reader Comments (rss)

(Flattened / Threaded)

  1.  
    identicon
    Joel Coehoorn, Apr 20th, 2007 @ 7:10pm

    I imagine the problem was an automatically applied update to a windows server on Patch Tuesday, the day each month on which Microsoft releases new updates.

     

    reply to this | link to this | view in thread ]

  2.  
    identicon
    Anonymous Coward, Apr 20th, 2007 @ 8:04pm

    It doesn't matter.

    No matter what you do, occasional screwups will happen. I'm an anal retentive software developer who believes in testing first and foremost, yet there are things which can get past testing and QA simply because real world stress is different than your testing process can anticipate in many cases. Edge case combinations of issues are the bane of all software/hardware developers because they can not properly test for such things up front in all cases.

    It is very possible that they released a minor patch to fix something and that caused a cascade failure when released into the wild. An unexpected and untestable side effect of the "minor" patch screwed up many other things. This is a very common issue when it comes to "minor" items that blow up in your face.

    I'm not trying to say RIM doesn't have to properly answer "WHY" this happened, I'm just stating that at this "point" they could still be doing indepth data analysis and simply posting the "general" result of what they have found so far. I've done live patches in the past and while I've never had a cascade failure I've always known it was possible. (100K'ish subscriber level, not anywhere near the level of RIM.) I know that some day, some time, in some way I will miss some small detail and cause a cascade failure, it WILL happen.

    So, understanding that there is never going to be 100% uptime, an outage of this type is deplorable yet a reality of large systems engineering. RIM "could" be a bit more upfront about what has gone wrong but on the other hand they very well could be scatching their heads over just what "really" caused the problem.

    Now, personally, understanding such things, I would prefer that a company is up front about the reason for the down time. As a technically inclined person, and more importantly, one of the folks who would be asked to justify usage of XYZ system over others, I would not want this sort of generic response which doesn't make a lot of sense to be common. I would want straight answers to the problems and what they are doing to fix them, that's more important than denying there is a problem.

    KB

     

    reply to this | link to this | view in thread ]

  3.  
    identicon
    Ju1c3, Apr 20th, 2007 @ 8:21pm

    I am glad to hear that it was their problem, and not something I did. i am a network consultant so i enivetably got the question about the blackberry's a few times. had my head scratching there for a few...

     

    reply to this | link to this | view in thread ]

  4.  
    identicon
    Bobshaker, Apr 20th, 2007 @ 10:12pm

    Don't own a Blackberry, don't intend to, don't care. yay.

     

    reply to this | link to this | view in thread ]

  5.  
    identicon
    Anonymous Coward, Apr 20th, 2007 @ 10:49pm

    anal retentive?

    you do that anal retentive means "full of crap"

     

    reply to this | link to this | view in thread ]

  6.  
    identicon
    Anonymous Coward, Apr 20th, 2007 @ 11:27pm

    I'd like to know why companies with a blackberry enterprise server still have to have their emails routed through RIM. There's no reason why the blackberrys can't use their normal data connection and just sync with the BES.

     

    reply to this | link to this | view in thread ]

  7.  
    identicon
    Fred Flint, Apr 21st, 2007 @ 6:51am

    Just Saving a Few Bucks

    Like most large Canadian companies (like banks), I'm sure RIM regularly fires their experienced and competent staff because such people are expensive.

    I'm also sure RIM hires students fresh from school, supposedly because they are most up-to-date on the technology and of course, they work real, real cheap.

    Of course, as soon as the students start figuring out what the hell they're doing, they want more money and of course, they get fired and a new flock of fresh-faced students gets hired.

    This is a dirty little I.S. secret that's been true for many, many years.

    From years of observation, my best guess about the outage is that it was caused by some new blockhead student who didn't know a bit from a byte but decided to "fix" something anyway.

     

    reply to this | link to this | view in thread ]

  8.  
    identicon
    Peter, Apr 21st, 2007 @ 9:42am

    re: post # 6 & 7

    Do some more research on how this whole Blackberry thing works. BES (Blackberry Enterprise Server) is just the facility to allow connectivity to your local (behind the firewall) resources.

    As for post 7... RIM is, in my experience, one of the more reliable technology companies out there. I don't manage any other systems that are as reliable and low maintenance as theirs.

    And no, I don't work for them or have any particular investment. Just a very happy customer.

     

    reply to this | link to this | view in thread ]

  9.  
    identicon
    pickford, Apr 21st, 2007 @ 11:44am

    RIM and BB are fighting an uphill battle. There are devices that do exactly what BB does, only better. I am a network admin and we have more BB issues than we do Treo issues. Personally, I wold rather have a WM OS on my Treo 700 and use exchange's built in active sync than install a buggy, costly middle man like BES.

     

    reply to this | link to this | view in thread ]

  10.  
    identicon
    Anonymous Coward, Apr 21st, 2007 @ 5:31pm

    Re:

    I've never seen a WM device push emails as fast as a BB. Are you talking about using MS exchange server? Or the provider's email service? Either way, the BES has only failed me this one time in four years and I was still able to access my email thru the web using Opera-mini on my device so I didn't miss out on much.

     

    reply to this | link to this | view in thread ]

  11.  
    icon
    James (profile), Apr 22nd, 2007 @ 3:56am

    Hmmmm

    Looks like some SonyEQ system admins went to work for RIM.

     

    reply to this | link to this | view in thread ]

  12.  
    identicon
    Ryan, Apr 22nd, 2007 @ 5:03am

    Looks to me that they should focus on why the fail-over plan did not work as well. The issues happen, but the fail-over should take over.

     

    reply to this | link to this | view in thread ]

  13.  
    identicon
    Fred Flint, Apr 22nd, 2007 @ 6:01am

    Re: re: post # 6 & 7

    As for post 7... RIM is, in my experience, one of the more reliable technology companies out there. I don't manage any other systems that are as reliable and low maintenance as theirs.

    Well, your argument convinced me.

    I guess RIM doesn't "downsize" the well-paid, experienced staff and hire a bunch of inexpensive bozos, fresh from college.

    I guess their system didn't go down unexpectedly for no honestly explained reason.

    I guess someone actually did provide for a reliable back-up system and someone actually did institute effective change controls and other basic I.S. common sense activities but, dammit, they just didn't work.

    I hate devastating arguments like yours. They make me feel so uninformed and inexperienced.

     

    reply to this | link to this | view in thread ]

  14.  
    identicon
    Nick Rao, Apr 22nd, 2007 @ 6:02am

    Blackberry Outage

    Deploying software updates in the middle of the week would be classified as "worst in class". Inadequate testing is another example of an organization that does not have a solid process for developing and testing code. If these folks are the gate keepers of critical corporate communications worldwide, then corporate clients must demand information on plans to address the systemic issues, not just an "opps, we'll do better next time". OBTW, these types of process problems takes months, if not years to resolve.

     

    reply to this | link to this | view in thread ]

  15.  
    identicon
    Anonymous Coward, Apr 22nd, 2007 @ 8:34am

    Re: Hmmmm

    Looks like some SonyEQ system admins went to work for RIM.

    that's what i was thinking. haha

     

    reply to this | link to this | view in thread ]

  16.  
    identicon
    Peter, Apr 22nd, 2007 @ 7:29pm

    Re: Re: re: post # 6 & 7

    Fred... Apologies for apparently stating my opinion as fact. My intent wasn't to make anyone feel uninformed or inexperienced, but simply to offer a counter point to the inevitable corporate bashing that occurs here. You obviously have some insight into the inner workings of RIM that none of the other posters here could hope to match. I defer herewith to your superiority.

    For clarification, in referring to "any other systems" I realize I was not being accurate. There are other systems that are as reliable and low maintenance... they are, however, few and far between.

     

    reply to this | link to this | view in thread ]

  17.  
    identicon
    Scribble, Apr 23rd, 2007 @ 8:03am

    Don't Rush QA!

    Okay, folks - you want it FAST or you want it GOOD! You can't have it both ways! This looks to me like an example of "QA's holding up the release again". Don't blame us when you release the patch before we're done testing it.

     

    reply to this | link to this | view in thread ]

  18.  
    identicon
    pickford, Apr 23rd, 2007 @ 12:16pm

    @April 21st, 5:31PM - I am talking about M$ Exchange. I was told when I got my WM based Treo that it would not push as fast as BES, wrongo. Recently on a business trip with the director (who uses BB) a mass email was sent out and our devices alerted us of it at the same exact time. Also, while we were out there, an automated process I have setup in case of BES failing went off. So I received a message letting me know that the BES service had gone down and failed to re-start. I then, on my Treo, in the car, logged into my Exchange server and got the service running.

    factor into that the fact that RIM charges for upgrades to BES, Windows updates active sync free of charge.

    Also, syncing with a desktop is MUCH easier and error free with AS than with Desktop Manager.

     

    reply to this | link to this | view in thread ]

  19.  
    identicon
    Fred Flint, Apr 24th, 2007 @ 7:33am

    Re: Re: Re: re: post # 6 & 7

    Peter,

    I appreciate your response and I will admit I sometimes go berserk when I experience the cavalier attitude of corporations and other large business entities when it comes to Information Systems.

    There are some pretty simple, well-known procedures to follow that will limit unscheduled downtime to something like 0.1 percent. That used to be the target for most mainframe shops and they met the target regularly - or they got fired.

    Not so, these days. For instance, my cable ISP seems to simply turn off their service, accidentally or on purpose, any time they feel like it; no warning, no apologies, no refund. It happens a lot. Hooray for monopolies!

    It is unfortunate when arrogance, greed and stupidity cause senior management to sacrifice dedication and professionalism on the alter of The Bottom Line.

    Worse, they usually adversely affect The Bottom Line, then blame it on the I.S. staff.

     

    reply to this | link to this | view in thread ]

  20.  
    identicon
    MO, Feb 12th, 2008 @ 7:24am

    RIM Architecture

    Folks: Any system that goes down for 10 hours (April '07) and then 6 hours (yesterday) is poorly architected. For all the BBY fanatics, remember there are added layers of complexity (mail -> mailserver -> BES -> RIM -> PDA) and that is not sound, despite the professed advantages. The "Evil Empire" got it right with ActiveSync (mail -> mailserver -> PDA). When all the Crackberry addicts were wondering what the hell was going on I was getting my e-mail just fine. And then there's the aborted failover attempt... guess they didn't give that process it due dilly, eh? If RIM had any clue they would fire whoever was responsible for the "failed upgrade" (yeah, right) and lame-ass failover.

     

    reply to this | link to this | view in thread ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This