Here’s another way to crash MATLAB!

May 17, 2012 at 8:00 am (crap data structures, dumb memory management)

Now that we know MATLAB’s memory manager has some problems with actually cleaning up too many objects at once, we can have some fun with it!

function pushdown(N, crash)
    %an N of more than about 1000000 with crash=1 is interesting.
    a = {};
    for i = 1:N
        a = {rand() a};
    end

    if ~exist('crash', 'var') || ~crash
        %feed it to matlab's deallocator one piece at a time, or it will choke.
        for i = 1:N
            a = a{2};
        end
    end
end

First try running:

>> pushdown(1000000)

Then try:

>> pushdown(1000000, 1)

What do you get? I get a hard crash.

So not only can you blow the interpreter stack with deallocation, you can blow the C stack as well!

I mean, a system can get by for some purposes without garbage collection… but not when reference-counting is done this badly.

Permalink Leave a Comment

oh wow

May 15, 2012 at 12:54 am (crap data structures, dumb memory management, errors in error handling, thirty misfeature pileup)

You know how errors in destructors are transformed into warnings?

I was doing some comparisons various strategies for filling out arrays in MATLAB. I looked at the docs and noticed an implementation of a doubly linked list, and thought, why not include that and see how badly it performs?

Just pull the code for dlnode out of your own MATLAB help files:

 edit  ([docroot '/techdoc/matlab_oop/examples/@dlnode/dlnode.m']);

then copy and paste it right into a file named dlnode.m on your matlab path, and then try this:

tail = dlnode(0);
for i = 1:510
	new = dlnode(i);
	insertAfter(new, tail);
	tail = new;
end
clear tail;
clear new;

The output from this overflows the command window buffer so I won’t post the whole thing. Here’s an excerpt:

  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
Warning: The following error was caught while executing 'dlnode' class
destructor:
Maximum recursion limit of 500 reached. Use set(0,'RecursionLimit',N)
to change the limit.  Be aware that exceeding your available stack space can
crash MATLAB and/or your computer. 
> In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64
  In dlnode>dlnode.delete at 64

….Yeah, not only is the error you get transformed into a warning, but what an error it was! Linked chains of handle objects pile up destructors on the stack and overflow if too large of a group of handle objects goes out of scope at once. And a goddamn stack overflow gets exception-funneled down into a warning!

Your computer has multiple gigabytes of memory. You can not make a linked list more than 500 items long without breaking MATLAB. Oh, wow.

Well, you can increase the recursion limit, but not far before you get hard crashes.

So how badly does the Mathworks’ own example of a linked list perform, anyway? Well, building a linked list N items long turns out to be between O(N^2) and O(N^3). Wowsers. (For the data-structures-impaired among you, it ought to be O(N).) And it’s factors of ten slower than the next slowest way of building up a list of indeterminate length.

These all — the absurd slowness of object property access, the recursive piling up of destructors on the stack, the exception-swallowing — these are all knock-on effects of the Mathworks’ insane notion that they can do automatic memory management without a garbage collector and guarantee deterministic destruction and support arbitrarily connected reference graphs. Well, a system built under these constraints can do at least one of three things: either it will have accomplished what in the history of computer science has never before been accomplished, or its time and space complexity is going to blow up, or it won’t actually work as advertised.

They might as well have assumed P=NP while they were at it.

UPDATE: It gets worse! 500 was the limit in R2010a, but as of R2011B, there’s twice as much recursion on finalization, so you can only make a list 250 items long!

Permalink 1 Comment

Multiple output arguments

May 28, 2010 at 2:57 am (crap data structures, Uncategorized, unexpressive language)

There are many reasons to be told why multiple output arguments in Matlab were a poor design decision. But here is a start. In an unsuccessful attempt to find a code that’s less than O(N*M) in storage for a problem that in any implementation in a real language is O(N) in storage, (*) I saw that someone wants to know,

Is there a similar way to get position information out of a matrix.

I tried something like,
[row,col]=min(min(matrix))
but that returns only the final row that the element is in, not its exact location.

This poor fellow wants to find the indices of a global minimum of a matrix. Now, min by itself finds the column-wise minimum of a matrix, and in a second output returns the row indices. Unless it’s a matrix with a size of 1 in the first dimension, in which case it mysteriously changes behavior and returns the row-wise minimum of a matrix. If the size of the matrix is 1 in both the first and second dimensions, it changes behavior again and gives you the minimum along the third dimension. No external indication is given of which dimension of minimum min happens to be pursuing… but I digress. The questioner just wants a 2-d min!

But here’s the thing: while composing min twice will give the global minimum of a 2-D array, there is no way you can compose two functions in one expression that makes use of the second output. So there is a failure to generalize: [m,i] = min(x) finds a minimum in one dimension, and tells you where the minimum is. The generalization m = min(min(x)) finds a minimum across two dimensions, but there is no way to use this form to find where the minimum is.

But here’s thing: there’s no simple expression that does it. Best I could come up with is:

i = ind2sub(size(x), second_output_of(@min, x(:)))

That will work, leaving second_output_of to be implemented by anyone who perceives a gaping lack in the language at this point — you see, while composing functions with multiple outputs is an ability virtually mandated by the presence of multiple outputs and the mathematical nature of the problems you might want to solve in MATLAB, the ability to use the second output in an expression is just plain missing.

Well, here’s an even more arcane version that uses only built in functions:

i = ind2sub(size(x), find(x == min(x(:)))

This doubles both the time and space requirements as well as bypassing the second output of min() altogether! What a lesson — if you want a simple matlab expression to find the location of a min in 2-d, you can’t use the existing function (the second output of min) that finds the location of a min in 1-D! Instead, you have to hunt down find, ind2sub, and the incantation (:) and live with code that’s twice as slow. There is no simple expression without these incantations that does it; if you want the simple compositionality of applying min twice you are limited to using multiple statements with intermediary temporary variables:

[m,r] = min(x);

[m,c] = min(m);

i = [r(c) c];

While functions with single outputs have the full compositionality of algebra available to them (well, unless you want do something simple like subscripting the output of a function, ahem), second and subsequent outputs are not composable unless you first assign their outputs to intermediary variables. This is unacceptable. A requirement of intermediary variables is evil. There is a longer post planned on this, but you may search for clues why, here and here, before I go live with that.

This is by no means the only problem with multiple outputs in MATLAB. More later.

(*) the answer to my original problem was something in the extra add-on ($$$) Signal Processing Toolbox. To sort points along grid lines. Well, I’m not the one paying for my license, except via externalities and frustration.

Permalink 3 Comments

Programatically creating a struct

August 12, 2009 at 2:14 am (crap data structures, trouble with small numbers, unexpressive language)

Say you are parsing a file and you read a list of column names, then a list of values to put into those column names. Simple enough, obviously you are going to put them into a 1×1 struct array. Perhaps use the ‘struct’ function to do this. Given a cell array of field names snames, and a cell array of values to associate with those field names svals, we can just use struct() to cook up a struct:

arglist = {snames{:};svals{:}}
s = struct(arglist{:})

Goodness there’s a lot of matlab-ism’s packed into that. To work out how that works you need to know: that x{} extracts cell array contents into a “comma separated list,” which is not an actual list like a data structure you can access (there aren’t any of those) but a syntax that unpacks a cell array or struct field access (but not a regular array or function call) into a place where the parser otherwise expects expressions separated by commas; that putting comma-separated expressions like {x, y, z; a, b, c} orders an array in row-major order BUT : orders an array in column-major order, and so on. Let’s ignore the fact that reasonable goddamn languages like Python or R write the above simply as s = dict(zip(snames, svalues)) or s = svalues; names(s)<-snames, respectively, and press on. Did you notice the condition in which the above will fail and break horribly?

You do remember how struct() works, right? You give the field name, then the value, then the second field name, then the second value, the good old ersatz broken named arguments pattern. And the GREAT thing about the broken ersatz named arguments pattern is that every function that implements it does the input parsing just a LITTLE BIT differently. (After writing dozens of name-value parsers, each working a little bit differently, MathWorks went ahead and came up with an InputParser class which gives you all the flexibility to continue making every argument list parser work just a little different.)

Take struct(), for instance. (Please!). Just as an exercise, try making this struct by writing an invocation of struct():

s =
    a: 1
    b: 2

The answer to this exercise is:

s = struct('a', 1, 'b', 2)

But your data are complicated: Some of the structure fields are themselves cell arrays. Now try using struct to make this struct with a singleton cell array:

s =
    a: 1
    b: {[2]}

Got that? Let’s try this one with an empty cell array.

s =
    a: 1
    b: {}

And this one with multiple elements in a cell in a member of the structure:

s =
    a: 1
    b: {'foo', 'bar'}

If you’ve tried this and were surprised, read on.

%There being no cell2struct, MATLAB's builtin function struct() seems to
%be the only simple way to efficiently programatically initialize a
%structure. But the behavior of struct() is not simple. It has
%*astonishing* behavior on cell array arguments.
%
%>> struct('a', 1)
%ans =
%    a: 1
%>> struct('a', {1})
%ans =
%    a: 1
%>> struct('a', {1, 2})
%ans =
%1x2 struct array with fields:
%    a
%>> struct('a', {})
%ans =
%0x0 struct array with fields:
%    a
%
% There are two overlapping behaviors: the behavior when the value to put
% in a field is a cell array, and the value when it is any other type. This
% makes it tricky to use struct() to make scalar whose field values are
% themselves cell arrays:
%
% >> struct('a', {1, 2}, 'b', {3, 4, 5})
%??? Error using ==> struct
%Array dimensions of input 4 must match those of input 2 or be scalar.
%
% To acheive the intended effect, you must double-wrap such arguments:
%
%>> struct('a', {{1, 2}}, 'b', {{3, 4, 5}})
%ans =
%    a: {[1]  [2]}
%    b: {[3]  [4]  [5]}
%
% Further complicating this is that struct() does 'scalar expansion' of
% non-cell or singleton cell arguments
%
%>> struct('a', {1, 2}, 'b', {3}, 'c', [2 4 5])
%ans =
%1x2 struct array with fields:
%    a
%    b
%>> ans.b
%ans =
%     3
%ans =
%     3
%>> ans.c
%ans =
%     2    4    5
%ans =
%     2    4    5
%
%From this we can conclude that the only consistent, general way to use
%struct() is to wrap all value arguments in cell arrays.

All right! So now you know how to REALLY combine a list of names and a list of values into a structure.

In Python:

s = dict(zip(snames, svalues))

In R:

names(svalues) <- snames

And in MATLAB:

svals = num2cell(svals);

arglist = {snames{:};svals{:}};

s = struct(arglist{:});

Permalink 3 Comments

No Lists

July 28, 2009 at 12:16 am (crap data structures, dumb memory management, essential band-aid helper functions, unexpressive language)

MATLAB is a language supposedly intended to help the user process data. Mind-bogglingly, it does not provide lists, which are pretty fundamental to processing data in the real world. What’s more, it does not even include the possibility of implementing a reasonable facsimile of a list.

If you’ve been indoctrinated into MATLAB, you’ve probably been told about the evils of resizing an array inside of a loop, and how you must avoid that by preallocating. Lately MATLAB presents a helpful compiler warning that tells you if you appear to be resizing an array in a loop. See, if you have an array X that is 10 elements long, and you put a value into X(11), MATLAB has to allocate an entirely new array, with 11 elements, then copy the original 10 elements into the new memory space, copy your new value in, then reclaim the memory taken by the original 10 element array. This takes a bit of time. What’s worse, the longer your list gets the more data it has to copy and the more time that takes — so if you keep adding things onto the end of your array, it transforms an O(n) loop into an O(n2).

So to avoid that massive penalty you can preallocate — make an array with zeros() or similar that’s large enough to hold all your results, before you go and compute your results. The problem with this aside from the extra lines of code (which decreases readability by taking up the reader’s mind with bookkeeping, as well as making a potential entry point for bugs if you mess up calculating the exact size you need) is that a lot of the time it’s just impossible to predict how many slots you need in the array before you actually go and collect the data that needs to go in the array.

One such situation is reading from a log file. When I run an experiment the end product is a text log file that details what happened in the experiment, in chronological order. Since I can’t know at the beginning of the experiment how many trials it has — some trials are thrown out for various reasons and the experiment might just terminate — I can’t preallocate the array to hold trial data when I read in the log file for analysis. Nor can I predict how many events (e.g. fixation breaks, cue onsets) a trial has before I read it in.

Additionally, there are a whole class of algorithms that rely on having a list that you can add or remove entries on the beginning or end of — breadth first graph traversal, for instance, and open up an algorithms text to a random page to see more — and it basically becomes impossible to implement those algorithms in MATLAB.

It’s almost natural to look at the way MATLAB handles adding an n+1th element to an n-element array and wonder, “gee, why recopy all that data? Why not just allocate one new place for entry 11, and add a link from the 10 elements to the eleventh element so that you just make your 10 element list a little longer?” If you could do that, you’d have a ‘list’, it’s such a natural thing to think of that almost every high level language you could consider using — except MATLAB — provides lists as a fundamental datatype. And low level languages like C provide you the tools to construct your own lists and other data structures. MATLAB is rather unique in allowing neither solution within the language.

Do other languages provide lists?

A better question might be, “do any languages in use today not provide lists or the opportunity to build them?”

However, since the two alternatives I’d like to push scientific computing users toward are R and Python, let’s see what they offer.

  • Python offers a builtin list datatype, and appending operations in particular are much faster, on average, than MATLAB’s. Python also provides the collections library which offers several different implementations of lists fine tuned to particular tasks — of course, all are transparently usable with the same syntax as regular lists, and can be used wherever a list can be used. If you really want compact arrays with no overhead and don’t care about appending, Python also offers that as well as the multidimensional MATLAB-like arrays of the Numeric package.
  • R‘s vector and list data types are implemented with a fixed array, like MATLAB’s, but R is really a LISP under the hood, so it definitely has the flexibility to implement data structures in user space. The standard library package Containers gives you some nice implementations.

Do any other languages used today not provide lists or let you create them?

I can’t think of a single language I’ve used, other than MATLAB, and maybe exercises in school where you write out longhand all the state transitions for a Turing machine, that doesn’t. The 1977 version of FORTRAN, maybe doesn’t. I suppose this tells you that MATLAB met its original late-seventies design goal of being easier to use than FORTRAN. It is now 2009.

What do you mean, I can’t implement a list? Doesn’t MATLAB have like objects ‘n’ stuff?

Give it a try, good luck! The first problem you find is that MATLAB does not support references or pointers (more on this in a later post) which are necessary to build your own list. Perhaps you will try MATLAB’s new ‘handle objects’ or notice that ‘function handles’ are essentially references to their local workspace variables. What you will find is that instantiating a whole damn object or closure for every reference you want to take is incredibly slow — and even if you manage to implement it, the way that MATLAB invokes methods on closures and objects does something very weird with memory management that guarantees you will not meet the o(1) append/delete that you wanted (more on that in a later post too.)

I suppose you could always make like a FORTRAN77 programmer, allocate a gobsmacking huge global array, and roll your own malloc into it. Heh.

What about JAVA, doesn’t that provide lists? And can’t you access it from MATLAB?

Can JAVA variables hold MATLAB data types? Or even handles to MATLAB data types? No. Not even just as an opaque object. (Someone please help this guy!) So you can pass simple numeric values and strings, but cells or function handles or objects (even “handle objects”) can’t be organized in Java containers. So this approach doesn’t get you very far.

What’s the best I can do if I insist on using MATLAB, yet I need to collect values in a non-preallocated way?

Something you can do in MATLAB is make a cell array of cell arrays of cell arrays, This lets you have an ersatz pushdown stack, the closest you can get to a sequence container:

n = 0;
stack = {};

function push(x)
	stack = {x stack};
	n = n + 1;
end

function x = pop()
	x = stack{1};
	stack = stack{2};
end

function out = readout()
	out = cell(1, n);
	for i = n:-1:1
		out{i} = pop();
	end
end

So you accumulate values with push() and when you’re done you read off an array with readout(). Note that if you try to encapsulate this technique to instantiate stacks using handle objects, or handles to nested functions, you’re in for a nasty surprise as you will not find your speed advantage — a later post to this blog will explore why. I have written a helper function accumulate.m that works around this problem via a persistent struct (with the consequence that you will leak memory if you don’t clean up your lists). Here it is:

function [push, readout] = accumulate(dim)
%function [push, readout] = accumulate(DIM)
%
%Peter Meilstrup, 2008
%
%Records a list of data (e.g. streaming eye position or other
%streaming input data) in memory without the speed penalty of reallocating
%arrays in a loop.
%
%Returns two function handles, 'push', and 'readout', you push data in
%using the pirst, then at the end of your accumulation read it out using
%the second. The accumulated data is concatenated along the dimension DIM.
%
% Example:
%
% >> [push, readout] = linkedlist(2)
% push =
%     @linkedlist/doPush
% readout =
%     @linkedlist/doReadout
% >> push([1;2]);
% >> push([3 4;5 6]);
% >> push([0; 0]);
% >> readout()
% ans =
%      1     3     4     0
%      2     5     6     0

persistent lists;
persistent counter;

if isempty(lists)
    lists = struct();
    counter = 0;
end

if ~exist('dim', 'var')
    dim = 1;
end

elements = 0;
name = '';

push = @doPush;
readout = @doReadout;

    function doPush(what)
        if isempty(name)
            name = ['t' num2str(counter)];
            counter = counter +1;
            lists.(name) = {};
            elements = 0;
        end
        lists.(name) = {lists.(name) what};
        elements = elements + 1;
    end

    function out = doReadout()
        if ~isempty(name)
            l = lists.(name);
            lists = rmfield(lists, name);
            name = '';
        else
            l = {};
        end
        if ~isempty(l)
            out = cell(1, elements);
            for i = elements:-1:1
                out{i} = l{2};
                l = l{1};
            end
            out = cat(dim, out{:});
        else
            out = [];
        end
    end
end

ADDENDA:

Permalink 4 Comments

Follow

Get every new post delivered to your Inbox.