Scientists Rename Human Genes To Stop Microsoft Excel From Misreading Them as Dates (theverge.com) 217
There are tens of thousands of genes in the human genome: minuscule twists of DNA and RNA that combine to express all of the traits and characteristics that make each of us unique. Each gene is given a name and alphanumeric code, known as a symbol, which scientists use to coordinate research. But over the past year or so, some 27 human genes have been renamed, all because Microsoft Excel kept misreading their symbols as dates. From a report: The problem isn't as unexpected as it first sounds. Excel is a behemoth in the spreadsheet world and is regularly used by scientists to track their work and even conduct clinical trials. But its default settings were designed with more mundane applications in mind, so when a user inputs a gene's alphanumeric symbol into a spreadsheet, like MARCH1 -- short for "Membrane Associated Ring-CH-Type Finger 1" -- Excel converts that into a date: 1-Mar. This is extremely frustrating, even dangerous, corrupting data that scientists have to sort through by hand to restore. It's also surprisingly widespread and affects even peer-reviewed scientific work. One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.
You know (Score:3, Insightful)
Re:You know (Score:5, Insightful)
On each computer the data touches, even temporarily. I think it would be great if you could share or load presets for Excel easily with a single import file. A registry file could do it for Windows. But it's something that has to be remembered. Every. Single. Time. And if you jump from device to device, that gets painful fast.
Re: (Score:3)
Re:You know (Score:5, Insightful)
Addendum: You can also type ="MARCH1" without formatting.
This just comes down to people not knowing how to properly use computing tools.
Re:You know (Score:5, Insightful)
Yes, because when you're churning through large data sets, you're going to be typing the data in line by line. CSV imports are going to be way more common. Try opening a CSV file with a value of MAR1. Try reformatting it before it gets corrupted.
Re:You know (Score:4, Insightful)
...Try opening a CSV file with a value of MAR1. Try reformatting it before it gets corrupted.
And it is merciless on leading zeros in CSV files. I have hated Excel's data arrogance since the dBase days. "Did you edit this DBF in Excel? The dates are all corrupted."
Re: (Score:3)
There are lots of people too who complain that a database isn't reasily readable in Excel, as if they think Excel is just another view into a database (or those for whom Excel is the one and only computing tool). This much data should never go into Excel.
Re:You know (Score:4, Insightful)
Yes, because when you're churning through large data sets...
If you are churning through large datasets why on earth are you using CSV and Excel? CSV is convenient for small datasets but it is a massively inefficient data format and Excel itself is terrible for analyzing large scientific datasets because it requires so much duplication of formulae and is incredibly slow compare to python or compiled code.
Re: You know (Score:3)
âzLargeâoe might be as few as 1000 rows. Still well within Excel capabilities but where you cant really manually fix all the data.
Re:You know (Score:5, Interesting)
You've got a grant to sequence and study a gene; are you going to spend that money writing custom Python code? What will you use to generate publication-quality charts and graphs from the datasets? Excel is ubiquitous, and basically free.
People have requested the ability to limit data conversion on CSV import for decades (why, for example, can I change the delimiters for importing data, but there is no 'Do Not Convert to a Date' option?). MS has a deeply entrenched hatred of third-party compatibility.
Excel was extended years ago to allow for 2^20 rows and columns, but still warns you if you try to plot more than 1,024 points of data that 'this could take a while'.
Re: (Score:2)
Re: (Score:3)
Is this why when I buy a camera it has "RAW mode" with the actual CCD output available? Because the simple output of the available instruments isn't important without first torturing and corrupting it with shoddy software translations?
Re:You know (Score:5, Informative)
Re:You know (Score:5, Insightful)
Practically everything under the sun can output data in CSV format. That's the source format, not what they're "using" it in. CSV imported to Excel because the massive number of people with basic spreadsheet skills and the equally massive amount of freely available scientific macros and templates.
Is it the best tool for the job? Probably not. But in most cases it is good enough, especially for anything "quick and dirty".
Re: (Score:3, Interesting)
CSV is for interoperability and reportability regardless of how inefficient or stupid it is. We get software that simply can't read Excel format and Excel cannot read their format either. That or we hit the Excel row limit and it breaks their analysis. We have to use something else to port it due to the end-user not caring that it's dumb. They want Excel. They hate HTML, PDFs, XML, JSON, small datasets, or whatever. We do what we have to do to meet the requirements and get paid. Bad requirements dri
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
Alternatively, you could cut and paste a bunch of junk from some other source, in that case the typical approach is to again provide meta-data about the columns (e.g. set the formatting for any columns that might get interpreted incorrectly).
And the simplest approach, if you just want to eliminate all of the rules, is to set all cells
Re: (Score:2)
I don't even know how to find the import function. When I double-click on a CSV or use the open menu, it just opens immediately without the dialog. I'm not sure when this changed.
Re:You know (Score:4, Insightful)
1 - Double click
Use this method when you know that your data will not be interpreted incorrectly
2 - Import and provide info about the coumns
Use this method when you are not certain that your data will be interpreted correctly
3 - An alternative to #2 is to format the columns and then cut and paste the data after formatting.
Something I do frequently is paste the data first so I know which columns are which, set the formatting on any columns I am concerned with, then paste again and the data is refreshed with proper interpretation.
The method used must match the use case.
Changing the interpretation rules to accommodate a minority use case will create more frequent issues with the majority use case.
Re: (Score:2)
This import wizard only happens when you open from the file menu AND it detects the file as delimited (at least on my system - Excel 16/Mac) otherwise it seems to open the file using defaults.
It DOES NOT present options if:
- you open the file from the standard OS handling for the file type or launch it from another app (eg. open attachment in email)
- you open the file using a context-click "Open With" from the OS
- the file is incorrectly delimited for the extension (tab delimited but the extension is CSV in
Re: (Score:3)
3. Google how to import a CSV file properly (i.e. import the file instead of just opening it) before complaining that Excel corrupted your data.
Re: (Score:2)
When you import a file into Excel, there are a series of dialogs where you provide meta-data about the columns (e.g. what type of data it is). This eliminates the problem.
It does not. There are literally hundreds of edge cases of where data will be mishandled even if you set it to text on import. Hell my favourite use the tableau web interface and export the data as a CVS file and import it into excel. The ability to handle USA dates will be broken if you used internet explorer to export the data, but not if you use Firefox due to something as simple as the locale and metadata generated.
Re: (Score:2)
Or, you could use Excel's Power Query to import all your files. It's been built into Excel since 2010. You can tell it exactly how to import your CSV or other files. Drop columns you don't want. Add new custom columns based on the data. Set data types for each column. And a hundred other things can be done and automated.
They practically built a SQL engine into Excel and no one is aware it's there! All your data can be loaded into data models, do table joins between them, build table relationships so you
Re:You know (Score:5, Informative)
This is actually why I prefer using LibreOffice over Excel, even though I have Excel provided for me. Libre does a much better job of handling CSV/TSV files - always bringing up the import dialog with options instead of just reading the file with defaults.
It makes it much easier and more reliable when I need to export subsets of data from the database and get it to "office workers" who need to use it.
Re: (Score:2)
If you don't like the software, that's fine. You do you.
But one of them actually works correctly for the task I often do, the other doesn't. That's why I prefer one.
dominance and submission user interface [Re:You... (Score:3)
Addendum: You can also type ="MARCH1" without formatting.
This just comes down to people not knowing how to properly use computing tools.
Yes, Excel has what's known as a dominance and submission user interface." The user will do things exactly the way the master instructs them, they will do this exactly in every detail, they will be punished for any mistakes, they will not ask questions.
Re:dominance and submission user interface [Re:You (Score:5, Insightful)
If Excel were like that, it would not be so heavily used around the world.
Oh, you naive person... "If C++ were a horrible language allowing for copious memory errors, it would not be so heavily used around the world." "If leaded gasoline were a toxic substance poisoning people wherever it's burned, it would not be so heavily used around the world."
Re: dominance and submission user interface [Re:Yo (Score:2, Interesting)
I acknowledge a certain amount of network effect, but C++ is slipping now that it has better competitors, and Oracleâ(TM)s bluster has not stopped the rise of other databases. When your UI really abuses users, alternatives rise. Excel has largely avoided that, and I cannot attribute all of that to MS sales pushes.
Re: (Score:3)
Which I turned off yesterday because no matter what I typed, Excel thought it knew what I wanted and put something else in its place, even after I fully typed what I wanted.
I have gotten so tired of being harassed by anything Microsoft does (among others) that I turn off every "helpful" tidbit just so I can get my work done or do what I need. Even the latest incarnations of Firefox foisted on me at work gets in the way when
Re: (Score:2)
You could write 'MARCH1 as well which is an indication to Excel to always handle a field as text. Basically though this is impossible to do if Excel is not the source data since there's no option in excel to automatically import data with a comma at the front. Having a text field also doesn't help.
Also your formatting options only work on your instance (actually it may be instances across a common locale). Strict formatting gets reset if you go over a different locale and someone views the sheet, excel does
One more reason to hate Excel [Re:You know] (Score:2)
You can format it by hand, yes. And that would solve the problem if only you ever use the data on that spreadsheet, and you do it on every spreadsheet you ever use and never forget, even once. And assuming you never transfer your data to somebody else who uses it in some other spreadsheet.
Fucking Excel, I hate it. If I want a cell to be a date I should be the one to format it as a date, it shouldn't try to second guess me.
Re: (Score:2)
Re: (Score:2)
You are a very rare user. Excel second-guessing users is why it is so heavily used. Most users rely upon it heavily for correct data entry, the same way they rely upon Word to autofix their spelling.
Re:You know (Score:5, Informative)
These problems usually arise when someone copies data from one spreadsheet and pastes it in another. I worked on a project that had terrible problems with dates getting changed by a fixed offset due to different date standards, to the point where we were supposed to verify each date with a PDF (yeah, right). It seemed like a horrible bug in Excel, until I watched the project coordinator doing the copying and pasting one day, between two files that were set up to use different date standards.
Morals of the story, in order:
1) don't use spreadsheets for critical data
2) know how your tools work
3) don't copy and paste data between spreadsheets
Re: (Score:3)
Morals of the story, in order:
1) don't use spreadsheets for critical data 2) know how your tools work 3) don't copy and paste data between spreadsheets
That all sounds great, but it isn't how science works in the 21st century. For your first point, you need to be aware of what tools are the most common. Yes, there are many tools that are better than a spreadsheet for scientific data but if you have the tool and your collaborator doesn't, you'll just end up going back to a spreadsheet because you both have that. Similarly if you are submitting a manuscript for publication, you need to submit your data in a format that the journal accepts (and that their
Re: (Score:2)
Morals of the story, in order: 1) don't use spreadsheets for critical data 2) know how your tools work 3) don't copy and paste data between spreadsheets
The second though is also a matter of how other people use the tools you use, what version of the tool they use, and other factors beyond your control. People earlier in this thread mentioned ways to keep Excel from turning "MARCH1" into "3-1", and some people have likely done that for their own work. But then what happens when the spreadsheet is opened on another PC? Back to "3-1", of course... /p>
That's not true - it's a setting in each spreadsheet. One you format a cell it stays that way. Like has been said before - Know your tools. It's easy to fix, and easy to spot. Of course, if you insist on using CSV - you should be using comma quote delimited, which would also avoid the problem by turning it into a string.
Re: (Score:2)
This type of problem can't be eliminated by changing the rules because it just pushes the issue from one set of users over to the other set of users that wanted that interpretation.
Column meta-data is the way to solve the problem, and is the way it's been handled for decades.
Re:You know (Score:4, Funny)
Exactly. I have no idea why people keep saying you have to do it on every computer, the column meta-data is stored within the spreadsheet.
Perhaps because, you ignorant twat, a CSV file is not "a spreadsheet" nor does it encode this magic meta-data you think magically appears within them.
Re: (Score:3)
"so when a user inputs a gene's alphanumeric symbol into a spreadsheet,"
Re:You know (Score:4, Informative)
The workflow for a lot of this work will likely go like this:
If they're lucky it only breaks 4 times in the main lifetime of the data (not including long-term storage of course). We've all seen Microsoft Excel eat shit on its own data before, and it will continue to do so until the heat death of the universe. There are too many variables that cause too many incompatibilities to count on this fix working and sticking. This is why the community agreed it's best to just rename the genes so this won't happen going forward - genes get renamed all the time (for many other reasons) so this is not that big of a change of course.
Re: (Score:2)
Hi, I'm a working scientist. I know how it works. I also know about all the errors that get made because that's how it works.
Trust me, when I tell people not to use Excel, there's a *very* good reason for it.
Re: (Score:2)
Spreadsheets should have a "scientific mode" which stops this kind of thing. Engineers would find it very useful too, we often use spreadsheets for stuff.
Re: (Score:2)
This is a bug in Excel. When you copy to the clipboard the data gets copied as simple text and in the native format of the application. In this way the native application can include any data that helps identify what data is text, numbers and dates and what standard the date is in and correctly format the data when pasted. So this is a bug.
Re: (Score:2)
And turning it off would probably lead to more errors for the people entering actual dates elsewhere in the spreadsheet.
Re: (Score:3)
Not just errors. If I type 3/1 I absolutely want it understood as 3/1/20 by default. Saves me lots of typing.
Re: (Score:2)
no dingbat just format the cells and it saves it in this magical thing called a file
Re: (Score:2)
Re: (Score:2)
True, but you just need someone to forget it once and it will somewhat painful to correct. Changing the names is likely the best option.
Re: (Score:2)
yea but research scientist are too dense to even consider that possibility let alone figure it out. These are people who spend their entire career cataloging data, but not actually doing anything with it
Re:You know (Score:4, Insightful)
"You can turn that feature off in Excel to stop that from doing that."
You can also use a fucking database to store data.
Re: (Score:2)
fact they are using excel and not a ledger book is amazing
Re: (Score:2)
Re: (Score:2)
This.
One could argue that it's actually bug in Excel.
There are certainly times when the feature is desirable, but for those occasions explicitly doing a Format -> Date on a column or cell should be more than simple enough to use as the preferable way to go.
Re: (Score:2)
Problem is Use of Excel (Score:3)
Re: (Score:3)
Except you can't turn it off in the Excel file. You send that to someone and they send it back, congratulations your data has been fucked.
Even better if that person happens to open it on O365 Sharepoint service because then he doesn't even need to save the file, just opening it will do the required damage.
Re: (Score:3)
Re: (Score:2)
Nice job M$
Can turn you it off? If not why? (Score:2)
Re: (Score:3)
No, it isn't. It's a symptom of a failure to adequately separate data representation from data presentation.
Re: (Score:3, Insightful)
The problem is that for every person that despairs at Excel for screwing up their precise inputs, there are a dozen that would be upset at Excel for failing to understand what they meant rather than what they (or someone else typed).
In an ideal world, everyone wouldn't use Excel for everything.
Re: (Score:2)
No, this is not a failure of separation of representation from presentation. This is happening at the time of data entry. The whole point is that Excel is interpreting the user input in order to figure out what the representation should be. The original thing that the user types is not intended to be the representation. How that data is presented -- 11-Mar, Mar 11, 3-11, 11-3 -- is a separate thing entirely.
Re: (Score:2)
Re: (Score:2)
So why you blaming the application? You can blame the people who chose the application for their specific dataset, or you can blame the people who chose the particular file format for this particular data, or you can
Re: (Score:2)
Either with humans entering the data, or by importing/reading a CSV (or other delimiter of choice) file and NOT setting the column formatting to "text" instead of "general".
On a CSV import to Excel, the "general" setting is kinda like a variable in PHP - untyped, and it will try to type it as just about anything other than a simple string. So under that process, MARCH1 becoming March 1 almost makes sense.
And it ain't just genomes - courses I work with include a basic marketing course (MAR2011) which always
Re: (Score:2)
Representing MARCH1 by a date format isn't a problem with failure to adequately separate representation from presentation; it's a problem with Excel defaulting to a highly optimized data representation. If you look, you'll discover that a date like Mar-1 is actually stored as an integer representing the number of days since (IIRC) 01 JAN 1900, with the cell marked with formatting telling it how to display the date.
This is actually a very smart way of storing dates. It makes it very easy to do lots of th
Re: (Score:2)
It's understandable why MARCH1 would become Mar-1st (or any other date formation equivalent), but can't you turn this off? If you can't turn it off, this seems like a rather weird oversight, and something that wouldn't be hard to build in rather quickly as a setting.
You can turn it off on individual cells, but you can't tell Excel not to do it at all.
Re: (Score:2)
This is entirely backwards, IMO...as once Excel has done this, the original data which was imported into the cell that it decided to present in a different format than may have been intended can be lost. Arguably the only data that Excel should EVER be automatically formatting as anything other than plain text is content that can ALWAYS completely losslessly converted back into plain text if desired, producing the exact text that was originally imported into the cell. If a user desires a column (or field
Re: (Score:2)
I maintain the point, if data cannot be losslessly converted back into the original text that was entered for the cell, then it should default the format to plain text.
This means that it is fine for the software to try and automatically format the presentation of the data for whatever format it thinks is suitable when it appropriate, as long as the plain text representation remains available when it is desired. So if I enter "MARCH1" in a cell or column that I have not already expressly formatted, it mi
Genes? (Score:4, Interesting)
I wonder if a spreadsheet is the best application (Score:3, Insightful)
How about stop using a spreadsheet for what a database would likely do better?
Or at least learn how to use the spreadsheet correctly before using it for purposes like this.
Yes, both options require some effort in learning something outside of the investigatory domain.
Re: (Score:2)
I doubt that freeform data manipulation is better handled by a database. Databases like defined structures. Building one-off databases just for data manipulation is overkill.
Re: (Score:2)
That's why there are platforms like Knime to integrate and manipulate data.
Re: (Score:2)
How about stop using a spreadsheet for what a database would likely do better?
You don't know what they're using it for, so you have no reason to think it's better to use a database. I could easily see a spreadsheet using numerical data for calculations, with the column headers the name of the gene that the data corresponds to.
Re: (Score:2)
It may be the most use database format ever.
Took me about 10 seconds to find a solution. (Score:3)
If you proceed the value by a single quote, it treats it as a string.
Also you can format the cells to string.
People just seem to be afraid to look for solutions to their problems when computers are involved.
Re: (Score:3, Insightful)
The problem isn't that stuff gets mangled when people enter it manually; if they were typing in "MARCH1", they would notice when it was converted to Mar-1. The problem is they're importing big data files, probably in comma or tab separated text, and Excel is mangling stuff behind their backs. This is particularly bad with .csv files, since Excel will auto-open them without asking how the data is formatted internally.
Now you certainly can blame the people for using the wrong tool or using the tool they h
Re: (Score:2)
The current rules of interpreting ambiguous data work very well for the vast majority of users (business), but they aren't perfect and do require manual intervention sometimes (e.g. UPC's and UCC128's converting to scientific notation).
Re:Took me about 10 seconds to find a solution. (Score:4, Insightful)
If you proceed the value by a single quote, it treats it as a string.
How do I do that when I import the data from a database? Or an auto generated CSV file? Sure I can set it as text when I import (sometimes that works, sometimes it doesn't, there are many edge cases where this will fail to be the fix) but then as soon as I save the file and hand it to someone else it gets boned. Move the file to a PC with a different locale, a lot of the fields get reset thanks to excel being "helpful".
People just seem to be afraid to look for solutions to their problems when computers are involved.
Not sure if that's better or worse than those who fail to understand the problems in the first place and blame everything on the user. The reality is Excel has a lot of settings that are *app* dependent rather than *workbook* dependent, and while I may be in control of my own data it can quickly become mangled if I pass it on to someone else.
Flowchart in wrong direction (Score:2)
Re: (Score:2)
Re: (Score:2)
They have corrected their data representation to be compatible with the tool in question.
Reminds me... (Score:5, Funny)
Reminds me of the old joke:
Q: What do incels and excel have in common?
A: Both have trouble recognizing that something isn't a date.
Hmmm (Score:2)
Libre Office is also a PITA (Score:2)
Even if I pre format the cells to display as dates, when I type for example 1/4 into a cell the damned thing changes the three characters into a single ¼ character. If I wanted a ¼ character I'd find it for myself. So far I've not managed to find a way to turn this bloody stupid behaviour off.
If I type 7/4 it displays what I wanted i.e. 07/Apr.
I can get it to work by typing 01/4 but when 7/4 works for 7th April why does some smart-ass programmer have to make it not work the same for 1/4? I don't k
Re: (Score:2)
Even if I pre format the cells to display as dates, when I type for example 1/4 into a cell the damned thing changes the three characters into a single ¼ character. If I wanted a ¼ character I'd find it for myself. So far I've not managed to find a way to turn this bloody stupid behaviour off.
If I type 7/4 it displays what I wanted i.e. 07/Apr.
I can get it to work by typing 01/4 but when 7/4 works for 7th April why does some smart-ass programmer have to make it not work the same for 1/4? I don't know. No wonder software is so bloated these days.
Tools > Autocorrect Options
See? See? Microsoft products are a VIRUS! (Score:4, Insightful)
USE PROPER SOFTWARE (Score:2, Insightful)
Excel isn't a database (Score:3, Informative)
repeat after me. Excel is not a database.
I am baffled by people using excel for things that should be in a database. Makes no sense to me. Spreadsheets make sense for finance usage, but even then, most robust finance application use a database.
"Everybody else is doing it" is a horrible excuse for using the wrong tool for the job.
Re:Excel isn't a database (Score:5, Informative)
Except that it is these days. They added PowerQuery back in 2010 and it can literally do almost everything SQL server can do. You create queries that load your source data (from almost anywhere and any format), do a ton of conversions, filters, table joins (left, right, full outer, etc), and processing on your data to get it exactly as you need it, then it stores it in a data model (database) in the file. You can even go far past the million row limit in a sheet.
Then, you tell it column relationships between all these tables in the data model. And build new queries based on those tables, or pivot tables to analyze, etc. And then, output your final data to a sheet or charts for people to look at.
Re: (Score:2)
Re: (Score:2)
You just have to enclose all the alphanumeric columns in quotes. Which either requires some creative text processing or software that is aware when creating them.
Re:Mark field as... (Score:4, Informative)
I tested it. Excel will convert "MAR1" to a date just the same as it converts MAR1. This really is a bug in Excel.
Spreadsheets in general are bug filled horrors and shouldn't be used for important data.
Re:Mark field as... (Score:5, Informative)
Apologies if English isn't your first language, but you really need to read more carefully before you reply. I said I tested it. Excel converts "MAR1", in quotes, in a CSV file to a date, the same way it converts MAR1, without the quotes. Your statement is false. This is an example of one of the many, many bugs in Excel that have persisted for decades.
Re: (Score:2)
You just have to enclose all the alphanumeric columns in quotes.
Nope. Quotes do not define the field type in CSV files. Quotes control parsing behavior.
Re: (Score:2)
Which turns them all into strings
Nope. it turns strings into strings, numbers into strings, dates into dates, times into times, and who knows what for currency.
which is the asked for behavior.
At least you got one thing right.
I'm not sure how answering the question in a way that solves the problem is wrong :)
The part where you are ignorant and wrong.
Re:Sad (Score:5, Funny)
Re: (Score:3)
We need a new mod: Too true to be funny...