Why Excel causes errors in gene studies and how to fix

A new report shows that Excel is causing errors to creep into genetic studies because of data conversion problems.

Excel tries to figure out the type of information you are typing or importing into a worksheet. Those rules work most of the time, but not always.  If Excel converts a cell wrongly, it changes the value with no simple ‘Undo’ available.  That means the results of any analysis will be wrong and, if there’s enough errors, then the whole genetics study could be wrong.

When you’re talking about hundreds or thousands of rows/records being imported then it’s easy to overlook mistakes.  See this report from 2004 which shows Excel conversion errors carrying through to published reports.

Source:  Zeeberg, Riss et al.

What’s going on?

In genetics they have short names for the substances they work with.  ‘Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase‘ is known by the catchier title MARCH1.

That’s great until you import data into Excel, which sees a value ‘March1’ and converts it to a date.  Even the exact date Excel gives isn’t certain because it depends on the date format set in Windows.  ‘MARCH1’ could become 1st March or March 2001.

There’s also RIKEN Identifiers (no?, me neither) which can be mistaken for large numbers.  ‘1234E5’ is converted to a large number by Excel.

Here’s some real examples of ‘bad’ conversion done in Excel taken from the abstract to the report.

The reports authors have published code to check studies and verify their analysis.

According to the report, Excel, LibreOffice Calc and Apache OpenOffice Calc all can mistakenly convert incoming values.  Interestingly. Google Sheets does not.

The Solution

The problem isn’t with Excel, it’s how people are importing data into a workbook. Gene data is often shared in plain text .CSV (comma delimited files). When a .CSV is opened, Excel does it’s best to convert the data into Excel style cells (Number, Date etc.) based on the content of each field in the .CSV.

But that’s NOT how Excel users should import data, never has been. Always, always import using the tools on the Data tab. Get External Data | From Text or, in more recent Excel’s Get & Transform.

Either way, you can control how Excel formats each column (field) of the imported data. In particular, ensuring the gene data is treated as Text.

Gene names changed to stop confusing Excel

Excel’s General problem that messes up what you type

Great text importing tricks for Excel

Import PDF’s into Excel and refresh later

Excel’s much better data import features