MATLAB can’t read plain text data out of a wet paper bag.
I’m working with someone, and they asked for some of my intermediate data. In the interests of what I thought would be maximum interoperability with whatever data analysis system they preferred, I gave them a .csv file.
It looked something like this:
"target_spacing","subject","sensitivity","direction_content"
2.0943951023932,"as",4.38700587042417,0.1
2.0943951023932,"cj",3.21032026169591,0.1
2.0943951023932,"jb",6.29391042968021,0.1
2.0943951023932,"je",-0.0328585668332463,0.1
2.0943951023932,"ko",-0.431840201907717,0.1
2.0943951023932,"mc",-4.49124425170629,0.1
2.0943951023932,"ml",-0.367323617983639,0.1
2.0943951023932,"nj",2.8780230556088,0.1
Simple enough, right? For example, if you were using R, and you wanted to do a scatterplot of “sensitivity” versus “target_spacing,” symbols coded by subject and colors coded by direction content, you do something like:
library(ggplot2)
data <- read.csv('dataset.csv')
qplot(data=data, target_spacing, sensitivity,
color=direction_content, pch=subject,
geom="point")
On the other hand, the MATLAB script I got back from this person looked more like this:
[ num, txt, data]=xlsread('datafile.csv', 'A2:Q192');
sublist= unique(txt(:, 5));
spacinglist=unique(num(:, 4));
colorlist=hot(length(dirlist)+3);
symlist={'o', 's' '*' 'x' 'd' '^' 'v' '>' '<' 'p' 'h' 'v' '>' '<' 'p' 'h'};
for s=1:length(sublist)
for d=1:length(dirlist)
for sp=1:length(spacinglist)
ind=find( num(:, 3)==dirlist(d) & num(:, 4)==spacinglist(sp) & strcmp(txt(:,5),sublist{s}));
if ~isempty(ind)
plot(spacinglist(sp), num(ind, 6), [symlist{s}], ...
'MarkerSize', 5, 'Color', colorlist(d, : ),'MarkerFaceColor', ...
colorlist(d, : ));
hold on
end
end
end
end
Well, this is fairly typical for MATLAB code as it is found in the wild Give people matrices and they use explicit numeric indices for everything. Aside from the difficulties of making marker shape and color covary with dimensions of the data, you have to open up the file in Excel and count along its columns to see what variable they think they’re plotting (and after some head-scratching it turns out they weren’t, actually, plotting what they thought.)
The factor-of-three reduction in code size on asking R to do the same thing MATLAB does is pretty typical too.
One of the very useful features of R is that you can assign names almost everywhere you would use an index. So, you never have to worry about whether column 4 is “target_spacing” or something else. You just say “target_spacing”.
For example, let’s say you have some nice rectilinear data, like this cross-tabulation of hair color, and eye color, and sex in a group of students:
data <- data(HairEyeColor)
> HairEyeColor
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
This is just a 3-D array, like Matlab’s 3-D arrays (Interestingly, Matlab only added multi-D arrays after someone got fed up with the lack of them and went off to write the Numeric package for Python.) And as an aside, NumPy and R have consistent rules for indexing in N-D (where N can be 1, 2, 3, or more), while MATLAB forgets about 1 dimensional arrays entirely, and as for consistency, utterly screwed it up.
Ahem. As I was saying, unlike an array in Matlab, arrays in R can have nice, human-interpretable names attached to their rows, columns, and slices. You can see them in the printout above, or get and set them explicitly with dimnames:
> dimnames(HairEyeColor)
$Hair
[1] "Black" "Brown" "Red" "Blond"
$Eye
[1] "Brown" "Blue" "Hazel" "Green"
$Sex
[1] "Male" "Female"
An array with dimnames, allows you to access elements by name, not number. So if you want to slice just the blond, brown-eyed people in this sample, you can just say:
> HairEyeColor['Blond', 'Brown',]
Male Female
3 4
That’s the same as writing HairEyeColor[4, 1,], only you can actually see what it’s trying to accomplish.
Now, I wish that you would be able to go a step further and write HairEyeColor[Eye='Brown',Hair='Blue',], and not worry about which order the dimensions come in, but R’s not perfect. Just useful. Actually, you can do that sort of thing with PANDAS, a Python library billing itself as “R’s data.frame on steroids.”
Meanwhile, if you pay an additional tithe to the Mathworks, you can get the Statistics toolbox, whose “dataset” class is more or less R’s data.frame with hyponatremia. (No ‘NA’, you see.)
Anyway, if you ever ask me to remember that “Female” is 1 (in this dataset) and “Hazel” is 3, well, look, remembering arbitrary correspondences between names and numbers is something humans are just really bad at and computers are very good at, OK? If you’re writing analysis scripts and you find yourself flipping back to the speadcheet to count columns… just don’t. Why would you do a job the computer should be doing for you?
Being able to refer to things by name makes your code more likely to work. For instance, in the previous example, a humanely designed system would be able to look at a statement you’ve messed up, like HairEyeColor[Eye='Brown',Hair='Blue',] and come back with "uh, there's no such thing as "blue hair" in this dataset." Which is miles better than coming back with the results you didn’t want.
Okay. Before I had the first gin and tonic and decided to cover a topic or two on the syllabus of Stuff That’s In Every Useful Programming Language Except MATLAB 101, I had this script someone sent me, that read in some data from a CSV file I’d sent them. And they were using numeric indices into the data because they had just loaded in the data as an array, using xlsread, which which doesn’t do anything useful about column headers. But you ought to be able to load each column of data into a separate field of a struct, use the column headers as struct field names, and refer to them by name that way, you know, and that’d be doing pretty good for MATLAB. So I was planning to tweak this code and send it back with a note about “here’s a nice way to do it better and let the computer take more of your mental load” (this person teaches a course on MATLAB for scientists, you see, so I want to slightly reduce the fucking brain damage that gets propagated out into the academic world.)
All you’d have to do is, instead of reading a CSV as a matrix, use the function that reads from a CSV file and uses the column headings to assign the fields in a struct. You know, that function. The one that does the single bleeding obvious thing to to with CSV files. You know, the CSV-reading function. I mean for all I rant about it, people get work done with MATLAB. It’s just impossible that Matlab can’t read the world’s most ubiquitous tabular data format usefully. Right?
Well, let’s try it.
The first thing I find is csvread. Aside from being deprecated according to the documentation, there’s another problem in that it only reads numeric data. Now, some of the columns in my file have things like a human observer’s initials, or categorical data that’s better expressed to humans with labels like “left” or “right” rather than trying to remember which one of those correspond to zero and 1. (R has a built in “factor” data type to handle categorically enumerable data, while MATLAB has bupkis.) So, csvread can’t cut it, because it only handles numeric data. Same problem with dlmread.
Next up we have xlsread. That’s what my collaborator used to begin with. Maybe it has an option to get the column names. Well, it won’t even read the file on my MacBook. Nor the Linux cluster we have in our lab. Ah, see, xlsread only reads a CSV file if it can farm its work out via a goddamn COM call to a motherfucking installed copy of Microsoft Excel, and it only knows how to do that on %$)@&#%%..% Windows. And, even if my computers met those conditions, xlsread doesn’t read a file with more than 2^16 rows. Man, I’ve got more than 2^16 rows sitting here just from asking people to look at things and press buttons. Lord help me if I ever have a real dataset.
CSV, you know, one of the world’s most ubiquitous, plain-text, human-readable file formats.
What next? There’s importdata which purports to DWIM the reading of tabular data. And there’s the “Data Import Wizard” which just turns out to be a wrapper for importdata.
Except importdata doesn’t handle the way quotes are used in CSV files. Even if that weren’t a problem, it doesn’t work at all. It detects that there’s a header row but it doesn’t actually give me the field names–why? Some experimentation reveals that it’s, again, completely incapable of handling non-numeric data in columns — even though it purports to put out separate ‘data’ and ‘textdata’ results! Here’s how ‘importdata’ mangles a perfectly straightforward file:
>> type testfile.txt
Height,Width,Depth,Weight,Label,Age,Speed
95.01,76.21,61.54,40.57,Charlie,20.28,1.53
23.11,45.65,79.19,93.55,Echo,19.87,74.68
60.68,1.85,92.18,91.69,Delta,60.38,44.51
48.60,82.14,73.82,41.03,Alpha,27.22,93.18
89.13,44.47,17.63,89.36,Romeo,19.88,46.60
>> [a delim nheaderlines] = importdata('testfile.txt')
a =
data: [5x2 double]
textdata: {6x7 cell}
delim =
,
nheaderlines =
1
>> a.textdata
ans =
Columns 1 through 6
'Height' 'Width' 'Depth' 'Weight' 'Label' 'Age'
'95.01' '76.21' '61.54' '40.57' 'Charlie' ''
'23.11' '45.65' '79.19' '93.55' 'Echo' ''
'60.68' '1.85' '92.18' '91.69' 'Delta' ''
'48.60' '82.14' '73.82' '41.03' 'Alpha' ''
'89.13' '44.47' '17.63' '89.36' 'Romeo' ''
Column 7
'Speed'
''
''
''
''
''
>> a.data
ans =
20.2800 1.5300
19.8700 74.6800
60.3800 44.5100
27.2200 93.1800
19.8800 46.6000
So, it detects the delimiter and the single header row, but it doesn’t give back column names…why? The ‘textdata’ is full of perfectly reasonable numeric strings that haven’t been converted into, y’know, numbers, but some of them have been blanked out. The ‘data’ pulled out a minority of the numeric data but gives you no idea which columns it pulled out for you, and that’s the best that I’ve seen so far.
The File exchange was not helpful. csv2struct was just a wrapper for xlsread (requires Excel on Windows, limited to 65,535 rows.) txt2mat claimed to be ultimately versatile and able to handle mixed-datatype CSV files, but rejected everything I gave to it, unless I threw enough regular expressions at its options that I might as well have written my own CSV parser.
So I ended up writing my own fucking CSV parser. And at several points I got waylaid by things like:
-
textscanwill skip over blank fields when they occur at the end of the line (which are common in actual data), and if there isn’t a consistent number of fields (that it doesn’t skip) per line, it will cheerfully forget line boundaries for you. So you need to do conversion line-at-a-time. -
There’s no good way to convert a cell array of strings to numbers,
str2doubletries, but it outputs NaNs whenever there’s an empty string, or anything else it can’t convert. So there’s no way to tell whether some converted value is NaN because the file said “NaN” versus whether some value is NaN because the file said “booger police.” See, the thing is, NaN is a value in the IEEE 754 system that is used to represent undefined values, invalid operations, and the like. The purpose of NaN is to signal problems that happened with your arithmetic. NaN is not a missing value marker, unless you really want something to to obscure places where your math is going wrong. (This is why R allows explicitly missing — NA, not NaN — values, in any vector — not just floats.) -
MATLAB’s version of
sscanfcan’t apply an alternate locale. If you ever interact with people outside the US you will see some CSV type files with conventions like: “,” for the decimal separator, “.” for a thousands place separator, and “;” for the field delimiter. That’s why the whole LOCALE facility in the C standard library exists. R provides the ability to set the locale for the purposes of reading such a file; whereas MATLAB’s documentation explicitly forbids setting the locale, even in a MEX function. -
Speaking of MEX functions, I might have gotten this done faster if I had gone that route and done the parser in C/flex/bison like a grownup, instead of expecting MATLAB to be any help at all in doing stuff like converting several strings to numbers.
So, as you see, reading a CSV file into MATLAB entails a whole lot of bullshit.
By comparison, here’s how you read that exact same file in R.
> x = read.csv("testfile.txt")
> x
Height Width Depth Weight Label Age Speed
1 95.01 76.21 61.54 40.57 Charlie 20.28 1.53
2 23.11 45.65 79.19 93.55 Echo 19.87 74.68
3 60.68 1.85 92.18 91.69 Delta 60.38 44.51
4 48.60 82.14 73.82 41.03 Alpha 27.22 93.18
5 89.13 44.47 17.63 89.36 Romeo 19.88 46.60
> class(x$Width)
[1] "numeric"
> class(x$Name)
[1] "factor"
You see? Ask R to read in a table, and it makes a good guess at the appropriate data types and headers, and you can refer to the components by their actual names. This stuff just isn’t so hard.
Richard said,
August 15, 2012 at 1:48 am
This isn’t related, but you might find it interesting the Matlab can no longer open a file in the editor while running a computation in R2012a. This is a step back from R2008b which could achieve this feat.
I’m referring to using the File->Open command as well, not typing edit foobar.m or similar during a computation. This means if I start a long-running computation and felt like doing some coding in the meantime I now have to use a different editor. Something I didn’t have to do four years ago. Now you must predict what you would like to work on and have it all open in advance if you don’t want to just sit and twiddle your thumbs while you wait.
I was frustrated by this and tried to submit a bug report about it through the Matlab bug reporting dialog, only to find it also can’t be used during a computation either (it opens and lets you type out your report but you get to the end and find it can’t populate a drop-box while matlab’s running).
Dorothy Bishop (@deevybee) said,
August 20, 2012 at 1:49 am
I would just cut and paste a file like this straight into Matlab. Just needs a few curly brackets around it and it is accessible. But I do only deal with fairly small files and I appreciate it’s non-optimal. Sometimes life seems too short and a quick fix is the answer.
crowding said,
August 20, 2012 at 4:22 am
Watch this and never ever ever ever mention the words “cut and paste” around me with regard to data.
Shit like cutting and pasting from Excel to Matlab got a group of patients exactly the wrong chemotherapy drugs in a human subjects trial.
Jordi Gutiérrez Hermoso said,
August 22, 2012 at 9:12 am
About NaN not being for missing values, actually, it is. I’ve had this debate too, thinking as you did that NaN can’t represent missing values, but after reading the damn IEEE 754 spec, it says that, yes, in fact, missing value is another interpretation of NaN, along with an erroneous calculation.
Fun fact, which I also learned relatively recently: there are many bit patterns that are NaN: any bit pattern that has all of the exponent bits lit and a nonzero mantissa is a NaN. R’s NA value is in fact a special NaN value, with a particular bit pattern in the mantissa (I think the bit pattern is some sort of internal R joke about someone’s birthday or something like that). In Octave, we do have NA values, with the same bit-pattern as R’s NA value, but the problem is that “isnan” also returns true for NA values, because, y’know, NA values *are* NaN (I believe R behaves in the same way).
This is why Alois Schögl has written a NaN toolbox for “stats” that skips all NaNs, because for *him*, NaNs only represent missing values (I’ve had many a flamewar with him about this).
As to your overall point with csv2struct, point taken. I’m trying to implement csv2struct in Octave now (without needing xlsread).
crowding said,
August 31, 2012 at 2:01 am
You can ‘interpret’ NaN as a missing value — as far as I’m aware you can ‘interpret’ the bitpattern of a NaN any damn way you please as the IEEE doesn’t say anything about what they are, only that some NaNs are produced by some operations. So you can interpret a certain NaN bit pattern as a signal to go down the pub and eat a sandwich and that’d fit the spec. But some NaNs are definitely produced by arithmetic errors, and without a language facility for telling error NaNs apart from whatever other interpretation you want to place on them, the only way you can interpret a NaN in a way that bears correspondence to reality is as “arithmetic error or some other interpretation.” Which I do not find useful.
In R is.nan(NA) is false; you can tell them apart. The implementation being based on a NaN bitpattern is an implementation detail (and it is implemented in other ways for logical and character arrays).
An appeal to IEEE 754 is only really useful if you’re going to really support IEEE 754; that includes being able to tell NaNs apart. (As long as we’re on IEEE 754, how about rounding modes, any hope there? Whoever does that would get at least a few of the hardcore numerical analyst crowd, if there are any of those people left.)
jordigh said,
August 31, 2012 at 5:47 am
You can tell them apart, with some care… is.na(NaN) is also true. Furthermore, telling them apart requires checking the bit pattern in software. It’s slower than doing “x != x”, which is a single hardware instruction (gcc translates it to ucomisd). It makes a difference of a factor about 10, if memory serves.
The IEEE 754 standard is quite clear that quiet NaNs’ bit pattern can be used to store “retrospective diagnostic information”, but checking this information can be slower than just checking if it’s NaN or not.
Chris said,
November 19, 2012 at 7:55 pm
[I didn't read the post, and I wrote some bullshit suggestion that doesn't work. I didn't even copy-and-paste the example data above and check whether my suggestion produced something useful.]
Maurits said,
November 20, 2012 at 3:02 am
[I didn't read the post, and I wrote some bullshit suggestion that doesn't work. I didn't even copy-and-paste the example data above and check whether my suggestion produced something useful.]
Bobby said,
November 20, 2012 at 1:45 pm
[I didn't read the post, and I wrote some bullshit suggestion that doesn't work. Also I was snarky in a "RTFM noob" sort of way, which is ironic because I didn't even copy-and-paste the example data above and check whether my suggestion produced something useful.]
crowding said,
November 20, 2012 at 6:53 pm
A few of you have ben writing in in tones varying from “faux-helpful” to “internet asshole” to suggest that “dataset” reads a CSV file. Aside from not being a part of MATLAB proper, rather being part of the “Statistics toolbox,” one of many shitty paid add-ons that Mathworks peddles, dataset DOES NOT read CSV files. In fact dataset’s failure at interpreting quoted strings was unfixable enough that it was better worth my time to write my own CSV parser. So you can knock it off.
If any of you had actually read this post you would see that an acceptable CSV reader needs to do a few things:
A CSV parser should handle both numeric and string data (and hopefully booleans and dates). And do so automatically. In particular, it should not force you to write a fragile
textscanformat string that you have to change every time the columns in your spreadsheet get rearranged. The point of using CSV is to be a flexible interchange format; one thing ‘flexible’ entails is that the number and order of variables in the file change as an analysis develops. I am not willing to write a new format string every time there is a minor change, when the default “the type is numeric if they all look like numbers” works 99% of the time. Especially not the way textscan does it, where using a format string disables checking for line breaks! So you're liable to get misaligned data that makes no sense.A CSV parser should read a leading header row and convert them into dataset variable names or struct field names.
A CSV parser should handle data with more than 2^16 rows, unlike
xlsread. For fuck's sake this isn't the damn 1980s, our computers have memory and we're trying to do science here.A CSV parser should interpret quotes. If you are going to call yourself a CSV parser you just have to correctly handle quotes. Quotes are a part of the CSV format. If you read comma-delimited data without handlig quotes, you are not a CSV parser. There is a particular way that the CSV format does quoting, and you can read about it in RFC 4180.
For example, if one variable column is named "subject", which is represented in the header line file as the word
"subject"inside quotes, then when I read the file, the variable name it produces should be "subject", not, asdatasetinsists with the above, "x0x22subject0x22".For another example, the first entry under the "subject" in the example data above is to be interpreted as
as, without quotes, not"as"without the quotes striped. The quotes there are part of the CSV format, not part of the data. Anything that claims to interpret a data format should be able to interpret the standard syntax of that data format. This is not a difficult thing tounderstand.
For a further example, if an entry under a field named "City" field contains the string "Eerie, Indiana", a CSV parser needs to be able to interpret that as a string containing a comma, and a CSV parser should not produce misaligned data or barf with a "requires the same number of delimiters on each line" just because the file correctly followed the CSV format. [Being forced to construct a format string with "%q" in some places and not in others is not acceptable; as described above it easily produces misaligned data.]
This is not a tall order; this is all just what R's read.csv or Pandas' read_csv give you out of the box.
Bobby said,
November 21, 2012 at 6:33 am
I can’t be bothered to check on the examples in the post I’m replying to. Freely available tools do a better job than the tools I use that cost thousands, and I find this somehow threatening to me.
Bobby said,
November 21, 2012 at 1:35 pm
I insist on a point by point rebuttal to a pile of insults!
crowding said,
November 21, 2012 at 9:07 pm
You have some amusingly wrong ideas, so you get toyed with for one comment.
That’s an amusing suggestion. I suppose that I could use sed and a particularly complicated regular expression to transform an industry standard RFC 4180 file into a subset of whatever fucking format
datasetis even capable of interpreting. And then you’d have to write the inverse transformation on the Matlab side, of course.The thing you are missing is that this is all just a a roundabout, incredibly stupid way of writing a parser, which, if you paid any fucking attention at all you’ll see I’ve already done you fucking idiot. Except that your way of doing it introduces an external dependency on UNIX, which, if you opened your seeing-eyes and pay the slightest bit of attention with your mind-brain you’ll remember I’m working with someone who uses MATLAB on its most popular platform, Windows. Because here in the adult world we collaborate with people, and use industry standard text data formats for interchange with people working on different platforms, using tools that can read those industry standard formats.
And that’s an interesting perspective you have there, on the relationship between computing and masculinity! Let’s see if it works.
One of the tropes associated with masculinity is that a man should be in control of things. Things that are in a subservient role to a man, such as a computer, should do his bidding. If the relationship between a man and his computer embodies masculinity, then the computer should do what computers do, in the way that the man tells it to. Follow?
One of the other tropes associated with masculinity is that a man knows how to use his tools. As in: a computer is a tool for eliminating repetetive effort. A computer is a tool to delegate trivial and error-prone mental labor to. A computer is not a master for you do bullshit unnecessary labor for.
So no, I don’t think adopting a subservient role to a fucking machine, repeatedly, mechanically doing for a computer what a computer ought to be doing for me, is particularly masculine. (If I use a more capable toolchain, it actually does it for me in one command — and MATLAB also does it for me, now that I’ve made it do so, you’ll recall if you have even half a second of short term memory.) Men get work done.
On the other hand, committing to rewriting your completely unnecessary
textscanformst string is exactly the kind of self-enslavement that would be done by the kind of UNIX weenie who mistakes knowing how to push buttons for getting work done, and has never learned to write a parser in his life.Men, in comparison, aren’t averse to trying tools that work better, or to correcting the defects in defective tools.