Using Spreadsheets In Bioinformatics Can Corrupt Data, Changing Gene Names Into Dates

from the careful,-now dept

A few years back, people were rather disturbed to find out about the famous Excel bug, whereby the multiplication of two numbers in Microsoft’s spreadsheet gave the wrong number. It turns out there are other circumstances in which Excel (and, to be fair, presumably other spreadsheets) can give incorrect results, but they are unlikely to be encountered in typical everyday tasks. However, in the specialized world of bioinformatics, which uses computers to analyze data about genes and related areas, careless use of spreadsheets can throw up a significant numbers of errors, as this paper in BMC Bioinformatics explains:

Use of one of the research community’s most valuable and extensively applied tools for manipulation of genomic data can introduce erroneous names. A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] was being converted to ‘1-DEC.’

Here we have the interesting interaction of two very different fields, where the name of a gene involved in esophageal cancer, DEC1, was interpreted by Excel to mean the date, 1 December. As the paper points out, these kinds of substitution errors are already to be found in key public databases:

DEC1, a possible target for cancer therapy, was incorrectly rendered, and it could potentially be missed in downstream data analysis. The same type of error can infect, and propagate through, the major public data resources. For example, this type of error occurs several times in even the immaculately curated LocusLink database.

As that notes, a gene that might be relevant for treating cancer could well be missed because of this incorrect conversion to a date by Excel. Although it is unlikely that any serious harm has been caused by this — yet — it’s a useful reminder of the dangers of depending a little too heavily on the results of software without checking for corruption of this kind.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Using Spreadsheets In Bioinformatics Can Corrupt Data, Changing Gene Names Into Dates”

Subscribe: RSS Leave a comment
85 Comments
Donglebert The Needlessly Unready says:

Re: Re: Re:

If you open excel, and then try to open a text or csv file, the Import Wizard will start and give the option to define each column’s data type before import. It won’t convert it.

The wizard doesn’t run if you click on a file and select to open it in Excel, or if you set your exporting application to directly open Excel, or if you paste the data into a sheet. It will then autocorrect the data into a date field that, at it’s heart, is a formatted number. If you convert that back to text, you’ll see the number.

Editor-In-Chief says:

This problem occurs in many different data analysis situations using Microsoft spreadsheets

I first came across this problem around 2001 with loading sub-contract data for analysis. it just so happened that some of the text data would be interpreted as floating point numbers, thereby rendering the data analysis completely wrong.

The data had to be pre-processed to insert “” around the relevant fields before loading the data into the spreadsheet. This did stop the spreadsheet conversion routines from touching that specific data. Unfortunately, the data was supplied in varying formats so that a uniform method couldn’t be developed at the time in the timeframe required.

David Oliver Graeme Samuel Offenbach

Lawrence D’Oliveiro says:

Spreadsheets Are Undebuggable

As an experienced software developer, I just have to roll my eyes in disbelief at the number of people who trust complicated spreadsheets to give the right answers without the ability to check that they actually do.

Even important economic research is being called into question because of spreadsheet errors.

Eldakka (profile) says:

Re: So instead of spreadsheets...

It’s not so much the spreadsheet that’s the issue, it’s how people are using it. If the data was imported in the correct way it wouldn’t have been an issue. However, most of the people doing this sort of data analysis (geneticists, biologists, economists) aren’t Computer Science people, therefore they may not even be aware that this type of issue can occur.

It’s the same as climate scientists stating “If you aren’t a climate scientist, you have no authority to comment on our work.” But the problem is, most of climate science (computer modelling, statistical analysis) isn’t “Climate” Science, it’s a mix of Computer Science (programming for computer modelling), Fluid Dynamics (also for computer modelling, you’ve got to accurately model fluid dynamics to have an accurate atmospheric model), Statistical analysis (all the climate chronologies that are done are based on Statistics) and so on.

When a non-Computer Scientist programs an advanced model, without input from a physicist/Engineer (for fluid dynamics), and picks and uses statistical methods without an in-depth knowledge of statistics to know which method to use when, you are going to have problems.

However, all these users (Climate Scientists, Biologist, Engineer and what have you) go: “Wonderful, I don’t need all these specialists (Computer Scientist, programmer, Engineer, Climate Scientist, Biologist), I can just use an off the shelf spreadsheet, whip up a few formulas in it, and presto!”.

A Computer Scientist might go: “I’ll just write my own program to do this climate modelling thing, I’ve got a fluid dynamics text book, I’ll just read that and implement it in code.” The code may be beautiful, concise, with the formula’s implemented flawlessly…until a fluid dynamics specialist looks at it and goes “why did you use that formula/paradigm there? That’s a rather exotic case, superheated plasmas, it’s not very good when you are modelling a layer of air that’s at -45C…”

Of course, if you don’t have an in-depth knowledge of spreadsheets (hell, a civil servant who lives, breathes, finance and the spreadsheets that are daily created probably knows how to use a spreadsheet program better than a ‘scientist’, whether that’s an Engineer, Computer Scientist, Economist, Climate Scientist what have you), or Fluid Dynamics, or programming or Computer Science, then you’ll probably end up with a balls-up.

Unless you have expertise in the other field, you should engage an expert in that field to at least advise on what you are doing. If a Biologist needs to manipulate a lot of genes with a computer…consult a computer scientist on the best way to do what you want done, and have them implement it for you if necessary…and if the implementation requires expertise from another field, say fluid dynamics, then the computer scientist should probably consult a fluid dynamacist, or statistician or whatever else is necessary.

Anonymous Coward says:

Re: So instead of spreadsheets...

“Clearly spreadsheets are efficient”

I do not understand this statement. Are you saying that the software in question is efficient at producing erroneous output? How can this be considered efficient when the user then has to expend countless hours finding and correcting the introduced errors. I find this to be grossly inefficient. You would be better off writing your own scripts.

nasch (profile) says:

Re: Re: Re:2 So instead of spreadsheets...

It was an example of how inefficient a poorly designed application is, when doing it yourself is potentially better.

So you’re saying making mistakes using Excel (because that’s what happened here) is inefficient, and it would be a better use of time for that same person (who is a scientist, not a software developer, and can’t use Excel properly) to write their own software from scratch?

Rekrul says:

This is a perfect example of Microsoft’s attitude of thinking they know what’s best for the user. Instead of allowing the user to set such options, their software just goes ahead and does it.

It’s the same thinking that lead to them adding a virus distribution mechanism to Windows and calling it AutoRun/AutoPlay.

Woadan (profile) says:

Re: Re:

The likelihood is that MS responded to many Excel community requests to make data entry easier for some sorts of data, such as dates, and so Excel now “senses” a date has been entered and converts the data to its default date setting.

Excel is used by many organizations for many purposes. You need to at least be aware of how the software functions. Otherwise why are you using it?

Rekrul says:

Re: Re: Re:

The likelihood is that MS responded to many Excel community requests to make data entry easier for some sorts of data, such as dates, and so Excel now “senses” a date has been entered and converts the data to its default date setting.

That should be an option.

I’m sure that there are probably quite a few people who wouldn’t mind having all their figures rounded up or down to make things simpler. Should the software just go ahead and do that for everyone?

Michael (profile) says:

Re: Re:

This is a perfect example of Microsoft’s attitude of thinking they know what’s best for the user. Instead of allowing the user to set such options, their software just goes ahead and does it.

Yay for Microsoft bashing. However, this is the DEFAULT behavior of excel when pasting in data from the clipboard that is in certain common formats. Someone actually familiar with the software would have know this was a possibility and could have taken lots of different steps to avoid it – the simplest being to set the format of the column to text before pasting (all of three clicks), or properly using the import features of the software to identify data types correctly.

This particular “issue” is not a software problem.

If I had to select the individual data type and formatting of everything I ever pasted into Excel, I would never get anything done.

Anonymous Coward says:

Re: Re: Re:

“Yay for Microsoft bashing.”

Well deserved bashing in this case.

Rather than make excuses and describe workarounds for a poorly designed software application, one might simply use a more sophisticated suite from a different source that is more suited to their purpose. I realize that in many corporate environments obtaining such special treatment is an uphill battle, so good luck with that.

nasch (profile) says:

Re: Re: Re: Re:

Rather than make excuses and describe workarounds for a poorly designed software application

It’s not a workaround, it’s using the software correctly. Maybe the default behavior isn’t right for these users, but that doesn’t mean the software is flawed (in this particular way). It’s not “Excel for Bioinformatics” after all. If a user doesn’t know how to import data correctly, doesn’t learn how before doing it, and ends up with incorrect data, why is that Microsoft’s fault?

Anonymous Coward says:

Re: Re: Re:2 Incorrect data

If a user doesn’t know how to import data correctly, doesn’t learn how before doing it, and ends up with incorrect data, why is that Microsoft’s fault?

The data is mangled by a “feature” that is on-by-default in every new document, and as I understand it, the user is never proactively informed that the feature exists. I understand the appeal of a tool that is immediately ready for use on first install (after Click Once gets done downloading the multi-GB installer for 15+ minutes ;)), but presenting an interface that looks like it will always “do the right thing” is an attractive nuisance. It lures users into assuming Excel will do the right thing, without any hint that its idea of right differs from what the user intends.

A dialog on first import saying “Excel has normalized some date-like strings.” could have alerted the user that the import mangled the results. The burden would then be on the user to go back and reimport the data correctly. A fancier version might offer links to documentation that explains what kind of normalization Excel enables and options to re-run the import in place with those normalizations disabled. A very fancy version could even import the data in non-normalized form, but provide an option to have Excel interact with it (render, search, transform, etc.) in normalized or as-imported form.

nasch (profile) says:

Re: Re: Re:3 Incorrect data

A dialog on first import saying “Excel has normalized some date-like strings.” could have alerted the user that the import mangled the results.

Sadly, users rarely read dialogs. Reading documentation is probably even less likely. Given that they’re not going to get much help from the user, software designers have to choose what is the best default behavior for the greatest number of users. Perhaps you disagree with the default behavior, but either way it’s going to be wrong for someone.

Michael (profile) says:

Re: Re: Re:3 Incorrect data

A dialog on first import saying “Excel has normalized some date-like strings.” could have alerted the user that the import mangled the results.

I’m not sure if you have noticed that software has really stopped popping up dialogs like this, but while they used to be standard practice, sometime in the late 80’s software companies realized that their customers HATE THOSE THINGS MORE THAN GETTING SOMETHING WRONG NOW AND THEN.

Avatar28 (profile) says:

Re: Re: Re:2 Re:

It’s not “Excel for Bioinformatics” after all.

Exactly. People need to remember that Excel isn’t marketed for that particular use. The vast VAST majority of users of this software benefit from these sorts of changes. Remember, most will be using it for things like financial data and stuff like that. For those people importing a date that displays in a different format can create issues. In the US we would write Dec 1 but in Europe 1 Dec is more common. I can pretty much guarantee that displaying gene abbreviations correctly is a niche use that NO designer ever had cross their mind. I’d also be surprised if the other packages on the market didn’t do the same thing.

Michael (profile) says:

Re: Re: Re: Re:

These options are easily turned off.

They are on by default because that is what most of the users want most of the time. They have gotten years of feedback, complaints, and requests to come up with the defaults – they didn’t arbitrarily decide that automatic date detection should be on and automatic fraction detection should be off.

However, the included documentation and the ability to control all of these things – just in case you don’t want them on for a specific purpose.

If they had gone your way, you would type in a list of numbers, highlight them, and wonder why it didn’t show you the total.

MrTroy (profile) says:

Re: Re: Re:

Microsoft does a lot of bad things in a lot of software, and don’t get me started on the APIs… but this is just a case of Microsoft providing what most of its customers want.

As mentioned above, the target audience for Excel was never bio-informatics… and yet they are still catered for because steps can be taken to import data correctly, instead of blindly pasting stuff around, not checking the results, then blaming the software for getting it wrong.

There are lots of suggestions that Excel should have warned that it had converted data to a date… but nobody has considered how much they would be pissed off when Excel issued a warning every time they entered a date, or a number – that’s just training your users to ignore warnings.

There are lessons to be learned from this story, but blaming it all on Microsoft is choosing not to learn.

John Fenderson (profile) says:

Re: Re: Re: Re:

“but this is just a case of Microsoft providing what most of its customers want.”

I’m aware of that, but catering to what most of its customers want means that they produce software which is dumbed down and allows for stupid errors to be easily overlooked. This effect is one of the reasons I avoid Microsoft software as much as I can.

“blaming it all on Microsoft is choosing not to learn.”

Don’t misunderstand me — I’m not blaming it all on Microsoft at all. This type of problem is hard to resolve in a way that pleases everyone. I’m just saying that Microsoft’s approach is one that consistently fails to meet my needs.

Anonymous Coward says:

Re: Re: Re: Re:

Warning that it has performed inconsistent conversions on a column would be useful – it is very likely to mean that either it has munged something that happened to look date-like or number-like (phone numbers, SKU codes, etc.), or there was malformed data in the column which you’ll need to fix.

MrTroy (profile) says:

Re: Re:

Why do you think Excel isn’t a database? And even if it isn’t a sufficiently generic database, there’s no guarantee that the front-end for any other custom-built database won’t have any data entry issues.

Why do you even think you know what the requirements for the job are? Presumably being able to share the document around is one of the requirements, and databases aren’t usually very good at that.

Also, from the article: For example, this type of error occurs several times in even the immaculately curated LocusLink database.… use of spreadsheets doesn’t preclude the use of databases.

Donglebert The Needlessly Unready says:

Re: Re: Re:

True, there’s no guarantee. But anyone writing a database has to define the field data types, whereas Excel by default makes assumptions for you.

And no, Excel is not a database application. It’s a spreadsheet application that can do some database type work.

Excel could be used for this sort of data work, but Woadan is entirely correct that a database would be the better solution.

Michael (profile) says:

Re: Re: Re: Re:

Database:
a collection of pieces of information that is organized and used on a computer

Excel most certainly falls into the category of database applications – although I would argue that Excel is a database, a user interface, a calculation engine, and programming platform, a reporting tool, and an analysis toolset.

But anyone writing a database has to define the field data types, whereas Excel by default makes assumptions for you.

Absolutely not true. I have worked with databases in the past that were entirely made up of un-typed data, lots of them I have worked with have effectively been defined as nothing but character data or blobs, and Excel makes default assumptions, but they are only defaults and you can override them very easily and it stores everything as typed data.

Woadan is entirely correct that a database would be the better solution

This is the kind of thing that gets me furious. Without any specifications, you cannot possibly make this determination. For all we know, the data we are talking about is a single table with 10 columns and 300 rows. Excel is a great solution for that. Frankly, a “database” would be pretty useless for this “problem” as it was not the data storage engine that set the type and reformatted the values – that was the user interface.

The best solution here would be a user that knows how to use whatever software they were using to import data – and setting the type properly when they did it.

Donglebert The Needlessly Unready says:

Re: Re: Re:2 Re:

I didn’t say that Excel couldn’t be a database, I said it was not explicitly a database application. That’s why Microsoft also produce Access – which is a database app, and SQL server, which is a rdms.

Excel is a spreadsheet app. Yes, like some database apps, it holds data in a tabular format. It also typically displays data in a tabular format, which most databases don’t do (unless you run queries – an area where Excel is also weak (but, then, it’s a spreadsheet app, not a db app). It can’t cope with significantly large amounts of data. In day to day use it relies on the same formulas being duplicated again and again with different inputs. It doesn’t create indexes effectively.

A pen and paper can be a database too. So can Word. So can Photoshop. That doesn’t mean they’re database apps.

And, if you come across databases where a field that requires a specific datatype has not been defined as that data type, then the developer didn’t know what they were doing or the spec they were working from was flawed.

And we’re talking genome data here. It’s not unreasonable to assume that the dataset was very large. Otherwise, this story wouldn’t be a story.

Richard (profile) says:

Once upon a time

Once upon a time there was a spreadsheet that was different. The formulae were clearly visible (they weren’t hidden in the cells). The data items had names – not cell addresses. In short it was to existing spreadsheets as high level languages are to assembly language.

I bought a copy (one of my very few software purchases). Twenty-one years later I still use it. It is still hugely better than excel and its clones. It was (is) called Improv.

The fact that it did not take over is a tragedy of market failure. The fact that I have increasing difficulty keeping it going (confined now to a virtual xp box) is a tragedy of copyright.

MrTroy (profile) says:

Re: Once upon a time

I feel your pain, but things aren’t so bad in the rest of the world now. I know they were just two examples, but you can specify names for cells and cell ranges in most spreadsheeting software now, and at least Excel allows showing formulas instead of results across all cells. It’s entirely possible that you could use Excel (or a different competing product) in the same way that you’re used to using Improv.

JP Jones (profile) says:

Re: Once upon a time

Um, Excel can do all of that. You can show formulas instead of results, you can assign cells and ranges specific names, and has an actual (if basic) programming language behind its macros.

The thing is that Excel is designed for simple stuff. It can do more complex things but it isn’t really designed for it. Excel is great for small forms with a couple of automatically formatted outputs, or a home budget, or keeping track of documents in a small office. It is not designed for advanced statistical analysis of gene structures.

Office is designed with several specialized tools for specific purposes rather than one swiss army knife that tries to do everything. You can make a presentation in Word, a simple spreadsheet in Access, and write a report in Powerpoint. It works. It just doesn’t do it well.

There was an easy way to avoid this, which a simple Google search would have revealed…click the upper left box to select the entire sheet, and change the drop down box from “General” to “Text.” Problem solved.

I personally use Excel extensively in my own job and am very familiar with both its strengths and its limitations. Is it perfect? Heck, no…I’d love for the ability to write formulas in multiple lines, sort of like how VBA is formatted, to help keep track of coding in a complex sheet when macros aren’t an option (usually due to business rules). But overall the quick formatting of data is useful for prototyping and reformatting data, creating useful rosters and graphs, and other simple tasks when your boss asks for a quick analysis of something.

This is like complaining that using the flat side of a power drill to hammer a nail isn’t very easy. Why not use a hammer? Well, I want to use my power drill, screw hammers.

Sorry. It’s not the drill’s fault you don’t know how to use a hammer.

Kyle says:

Re: Re:

A good drop in replacement for Excel and Access can be found at

https://www.libreoffice.org/

But as a number of posters above have said, nothing beats inputting the correct information and, just as important, make sure cell-formatting is done correctly.

I suppose any spreadsheet / database program will fall flat if these two criteria aren’t met.

I remember the good old days of DBASE III. Currently, they can be found at

http://www.dbase.com/

and they are sitting at version 9

Just my – stating the obvious – 2c worth. 🙂

Anonymous Coward says:

This is a perfect example of Microsoft’s attitude of thinking they know what’s best for the user.

this is exactly the situation, and it’s not just microsoft but pretty much all software are certain they know infinitely more about what the user should be doing than the user. for instance, doubling spacing after a sentence-ending period. most websites refuse that even though at times it really aids readability. momma knows what’s right for you.

another real pain to me is headers when sorting. about half the time excel declares the top cell to be a header and doesn’t include it in the sort. if you aren’t vigilant about checking, you may get an error there.

super annoying to have to do handstands to use the stupid software. if that auto-dating feature (which annoys me, too) causes a serious error, i hope the offended party sues the shit out of microsoft.

Anonymous Coward says:

Maybe they should have checked through the formatting on the spreadsheet. Maybe they should have used a different statistical analysis software. Maybe they shouldn’t have assume that using a basic consumer level program in a basic manner would be fine for statistical analysis.

We knew better than this in 2nd year stats classes in college. In 3rd year we were using alternative stat programs with more built in functionality. This isn’t a failure of Microsoft. This is people not being careful, not doing work how they were supposed to, being cheap, and trying to find a scapegoat for their sloppy work.

Anonymous Coward says:

In general, the less options you present to the user, the better.

some years ago i was employed using the design software catia v4, which was very amendable to make the environment the most useful to a designer that it could be.

one day a fairly new user complained to me that he wished there weren’t so many things that could be adjusted. i told him i agreed exactly and that all machines should be set up the way i like mine.

Michael (profile) says:

Re: Re:

he wished there weren’t so many things that could be adjusted

Damned if you do, damned if you don’t. In my experience it has always been that each user has their handful of things that should be configurable and “everything else is just clutter that should be removed”. And guess what, it was a different handful of things for each user.

Anonymous Coward says:

The problem is that Excel converts the *value* of the cell to the numeric equivalent that represents the date, and also converts the cell format to Date. So if you format as text, you fix one problem, but not the other. “DEC1” becomes 41974, for example.

yes, but it’s even worse than that. if excel can’t discern a year, it will insert this year into that date. so the spreadsheet you make that error on this year will equate that cell to 41974, but if you go back to a 2013 table, it won’t be that number for dec1.

Anonymous Coward says:

It works fine IF you format the columns BEFORE the import.

no, it doesn’t.  i just tried it.

now, it does if i set the format of a cell or group of cells to be TEXT and enter the value into the one-line input field up top or copy/paste from a text document one item of information into that field.

problem is if i copy/paste a large amount of info, say 30 football players’ info, from a document, i can’t put it all in that one-line entry field.  i have to pick a location in the sheet and dump all that info with respect to the place i picked to put it.  when i do that it changes the format of those TEXT cells to CUSTOM and changes those 5-10 football players’ heights to 10-May.  if i force that format back to either GENERAL or TEXT, i get 41769.

it doesn’t work unless i handpick some cells to format and transfer the information one item at a time and be careful to select the correct cell and dump the info into that upper entry field. ONE HEIGHT AT A TIME.

let me spell it: S U C K S.

i tried mass dumping three players’ heights and weights into that one-line upper entry field, and it took it, but put all those heights and weights of three players into that one cell, not an array of cells like i need.

like i say, S U C K S.

nasch (profile) says:

Re: Re:

no, it doesn’t. i just tried it.

now, it does if i set the format of a cell or group of cells to be TEXT and enter the value into the one-line input field up top or copy/paste from a text document one item of information into that field.

I don’t have Excel to test it, but are you only testing with copy/paste, or are you importing from a file? Because the latter has all kinds of options that the former doesn’t.

Anonymous Coward says:

i found another way the transfer works fine.

if i have info in a spreadsheet carefully formatted so that anything that excel would misread as a date is handled as TEXT, then i can select a bunch of cells, copy/paste that info anywhere in another spreadsheet, and it will reformat the cells to be TEXT in the new spreadsheet as needed.  i don’t have to pre-format those cells.  so that’s not too bad.

it evens transfers the justification changes i had made in the original spreadsheet.  in other words, it brings the formatting with the data.

transfers from any other source than another excel spreadsheet probably have to be very carefully handled, but if i painstakingly set up and excel spreadsheet, i can transfer without losing formatting.

Zonker says:

It looks like you’re trying to insert some dates into your spreadsheet.

Would you like help?

* Convert all text into date format.

* Leave all my gene names alone, you fool!

* Yes, I desperately in need of a date! Do you have any attractive single friends who have low enough standards to go on a date with me? (Forever alone!)

[ ] Don’t show me this tip again

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...