RIM's Excuse For BlackBerry Outage Finally Emerges
from the too-little-too-late? dept
Research In Motion has delivered an explanation of what caused the BlackBerry outage earlier this week -- sort of. It says an insufficiently tested software upgrade set off a series of errors at its network operations center, which processes all the emails for BlackBerry devices in North America, and then its "failover process", which is supposed to switch things to a backup system, didn't work properly. The company says that it has plenty of capacity and resources to deal with its volume of messages and growing user base, and that it will better test its upgrades in the future. However, that explanation -- and the long time it took to come out -- doesn't wash with some observers, who say there are enough holes in the story that it doesn't add up. In particular, RIM's contention that it was upgrading its software on a Tuesday night, rather than over a weekend, has raised some red flags. Then, if a scheduled upgrade was behind the problem, shouldn't that have been immediately obvious to the company and news spread quickly by its PR team? The real damage from this episode won't be the outage itself, but rather the fallout from how RIM deals with it. On that front, things already aren't looking so good.






Reader Comments (rss)
(Flattened / Threaded)
[ reply to this | link to this | view in thread ]
It doesn't matter.
It is very possible that they released a minor patch to fix something and that caused a cascade failure when released into the wild. An unexpected and untestable side effect of the "minor" patch screwed up many other things. This is a very common issue when it comes to "minor" items that blow up in your face.
I'm not trying to say RIM doesn't have to properly answer "WHY" this happened, I'm just stating that at this "point" they could still be doing indepth data analysis and simply posting the "general" result of what they have found so far. I've done live patches in the past and while I've never had a cascade failure I've always known it was possible. (100K'ish subscriber level, not anywhere near the level of RIM.) I know that some day, some time, in some way I will miss some small detail and cause a cascade failure, it WILL happen.
So, understanding that there is never going to be 100% uptime, an outage of this type is deplorable yet a reality of large systems engineering. RIM "could" be a bit more upfront about what has gone wrong but on the other hand they very well could be scatching their heads over just what "really" caused the problem.
Now, personally, understanding such things, I would prefer that a company is up front about the reason for the down time. As a technically inclined person, and more importantly, one of the folks who would be asked to justify usage of XYZ system over others, I would not want this sort of generic response which doesn't make a lot of sense to be common. I would want straight answers to the problems and what they are doing to fix them, that's more important than denying there is a problem.
KB
[ reply to this | link to this | view in thread ]
[ reply to this | link to this | view in thread ]
[ reply to this | link to this | view in thread ]
anal retentive?
[ reply to this | link to this | view in thread ]
[ reply to this | link to this | view in thread ]
Just Saving a Few Bucks
I'm also sure RIM hires students fresh from school, supposedly because they are most up-to-date on the technology and of course, they work real, real cheap.
Of course, as soon as the students start figuring out what the hell they're doing, they want more money and of course, they get fired and a new flock of fresh-faced students gets hired.
This is a dirty little I.S. secret that's been true for many, many years.
From years of observation, my best guess about the outage is that it was caused by some new blockhead student who didn't know a bit from a byte but decided to "fix" something anyway.
[ reply to this | link to this | view in thread ]
re: post # 6 & 7
As for post 7... RIM is, in my experience, one of the more reliable technology companies out there. I don't manage any other systems that are as reliable and low maintenance as theirs.
And no, I don't work for them or have any particular investment. Just a very happy customer.
[ reply to this | link to this | view in thread ]
[ reply to this | link to this | view in thread ]
Re:
[ reply to this | link to this | view in thread ]
Hmmmm
[ reply to this | link to this | view in thread ]
[ reply to this | link to this | view in thread ]
Re: re: post # 6 & 7
Well, your argument convinced me.
I guess RIM doesn't "downsize" the well-paid, experienced staff and hire a bunch of inexpensive bozos, fresh from college.
I guess their system didn't go down unexpectedly for no honestly explained reason.
I guess someone actually did provide for a reliable back-up system and someone actually did institute effective change controls and other basic I.S. common sense activities but, dammit, they just didn't work.
I hate devastating arguments like yours. They make me feel so uninformed and inexperienced.
[ reply to this | link to this | view in thread ]
Blackberry Outage
[ reply to this | link to this | view in thread ]
Re: Hmmmm
that's what i was thinking. haha
[ reply to this | link to this | view in thread ]
Re: Re: re: post # 6 & 7
For clarification, in referring to "any other systems" I realize I was not being accurate. There are other systems that are as reliable and low maintenance... they are, however, few and far between.
[ reply to this | link to this | view in thread ]
Don't Rush QA!
[ reply to this | link to this | view in thread ]
factor into that the fact that RIM charges for upgrades to BES, Windows updates active sync free of charge.
Also, syncing with a desktop is MUCH easier and error free with AS than with Desktop Manager.
[ reply to this | link to this | view in thread ]
Re: Re: Re: re: post # 6 & 7
I appreciate your response and I will admit I sometimes go berserk when I experience the cavalier attitude of corporations and other large business entities when it comes to Information Systems.
There are some pretty simple, well-known procedures to follow that will limit unscheduled downtime to something like 0.1 percent. That used to be the target for most mainframe shops and they met the target regularly - or they got fired.
Not so, these days. For instance, my cable ISP seems to simply turn off their service, accidentally or on purpose, any time they feel like it; no warning, no apologies, no refund. It happens a lot. Hooray for monopolies!
It is unfortunate when arrogance, greed and stupidity cause senior management to sacrifice dedication and professionalism on the alter of The Bottom Line.
Worse, they usually adversely affect The Bottom Line, then blame it on the I.S. staff.
[ reply to this | link to this | view in thread ]
RIM Architecture
[ reply to this | link to this | view in thread ]
Add Your Comment