Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Sorting cell arrays of strings

Subject: Sorting cell arrays of strings

From: Pete Boettcher

Date: 22 Dec, 2000 15:42:29

Message: 1 of 1


There was a discussion back a while about the horribly inefficient
sort function for cell arrays of strings. Included is a MEX file for
efficient sorting. The output seems to match sort.m for sane input, but
I don't guarantee anything... It DOES NOT return indices like sort
does... I'm not sure how to include this. Sorting is done by the C
library function qsort(), so mileage may vary with your C library.

There are two versions... if you uncomment #define MUCK_INTERNALLY,
cellsort will play all sorts of tricks inside the Matlab variable
structures. This is certainly not recommended by The MathWorks, and
blatantly ignores all the API functions. It is actually a very good
example of how you might use crosslinks as I described in my recent
post.

The "safe" version has to copy each string to a temporary buffer to
perform the comparison, hurting performance. The fast version does
absolutely no copying or memory shuffling. The safe version also
uses additional memory: exactly the size of the strings themselves
(total characters sorted * 2 bytes/char). The fast version saves this
memory, but still needs to allocate the cell array overhead of 100
bytes per string (this is unavoidable).

If you decide to try the fast version, don't blame me if your computer
explodes! (All standard disclaimers apply, and then some. During
development, Matlab crashed, it was unable to clear some variables,
etc. I can't guarantee these problems are fixed, though I believe
they are.)

Both versions were tested under Matlab 5.3, linux glibc 2.2. If
someone tries it under 6.0, please let me know if it works.

Cut and paste the code into the appropriately named file.

-Peter Boettcher

---begin cellsort.c---
/* cellsort.c: Efficiently sort a cell array of strings. Almost no
   memory shuffling is done... we only have to create another cell
   array for the output; the string data itself is shared between the
   two cell arrays. For the sort, only the pointers to the strings
   are moved around, and we don't do any string conversion. The C
   library qsort() is used. This is about as fast and memory
   efficient as is possible within the constraints of Matlab cell
   arrays and of qsort().
   
   Author: Peter Boettcher <boettcher@ll.mit.edu>
   Standard disclaimers apply. Especially if you use the "fast"
   version. Experimental code!!!
 */

#include "mex.h"
#include <stdlib.h>
#include <string.h>

#define MAX_LEN 50

/* uncomment this for faster, more efficient output, but be careful! */
/* #define MUCK_INTERNALLY */


#ifdef MUCK_INTERNALLY
#include "mxinternals.h"
/* This comparison function is in the form required by the
   libc qsort() function. It gets passed pointers to the elements
   in the array to be sorted, which themselves are mxArray*. So
   inA and inB are mxArray** types.

   This first version uses the apparent fact that strings are stored
   as arrays of 16-bit integers. It CIRCUMVENTS the MEX access functions
   like mxGetString, and thus may break on certain platforms or on
   certain Matlab versions. It works for me, but save your workspace
   before trying.

   Undefine MUCK_INTERNALLY above to use the safe method. */
int compfun(const void *inA, const void *inB)
{
  unsigned short int *ptrA, *ptrB;
  int i=0;
  int lena, lenb, minlen;
  mxArray *arrayA, *arrayB;

  /* Nasty dereferencing to extract the mxArray * */
  arrayA = *((mxArray **)inA);
  arrayB = *((mxArray **)inB);
  lena=mxGetNumberOfElements(arrayA);
  lenb=mxGetNumberOfElements(arrayB);

  /* Assume 16-bit characters */
  ptrA = mxGetData(arrayA);
  ptrB = mxGetData(arrayB);

  /* Custom comparison... uses ASCII values of chars, no
     locale or anything fancy. But it's fast and it works
     the same as Matlab's ridiculously slow sort.m */
  minlen = (lena < lenb) ? lena : lenb;

  /* Special case of empty string, to better match the output of sort.m */
  if(minlen == 0)
    return (lena < lenb) ? lenb : lena;

  while(i<minlen) {
    if(ptrA[i] == ptrB[i]) {
      i++;
      continue;
    }
    if(ptrA[i] < ptrB[i])
      return(-1);
    else
      return(1);
    i++;
  }
  
  if(lena == lenb)
    return(0);
  else if(lena < lenb)
    return(-1);
  else
    return(1);
}

#else /* MUCK_INTERNALLY */

/* Slower, safer version */
int compfun(const void *inA, const void *inB)
{
  char a[MAX_LEN];
  char b[MAX_LEN];

  mxGetString(*((mxArray **)inA), a, MAX_LEN);
  mxGetString(*((mxArray **)inB), b, MAX_LEN);
    
  return(strcmp(a, b));

}
#endif

#ifdef MUCK_INTERNALLY
/* Hack around inside the Matlab data structures. This is dangerous.
   Since this actually modifies memory inside the structures, you
   could end up with unclearable variables, or segfaulting or hanging
   Matlab. It seems to work though.

   This function duplicates a cell array of strings, but does not copy
   the actual string data. Unfortunately, the cell array structure
   means there is an overhead of 100 bytes PER STRING. This is
   unavoidable. However, only this overhead is needed for the output
   array. The strings themselves are not copied.

   To do this we have to be careful with the Matlab memory manager.
   If we use the correct mxCreateCharArray to create a zero-length
   string, then manually set the dimensions and data pointer, Matlab
   will be happy. We also need to hack around with the crosslink, so
   Matlab can figure out what to do if one of the arrays if
   modified. */

mxArray *smartCellStrDup(mxArray *in)
{
  mxArray *out;
  int rows, cols, i;
  int zerodim[2] = {0, 0};
  mxArray **mxptrin, **mxptrout;

  rows = mxGetM(in);
  cols = mxGetN(in);
  out = mxCreateCellMatrix(rows, cols);

  mxptrin = (mxArray **)mxGetData(in);
  mxptrout = (mxArray **)mxGetData(out);

  for(i=0; i<rows*cols; i++) {
    mxSetCell(out, i, mxCreateCharArray(2, zerodim));

    mxptrout[i]->data.number_array.pdata = mxptrin[i]->data.number_array.pdata;
    mxptrout[i]->rowdim = mxptrin[i]->rowdim;
    mxptrout[i]->coldim = mxptrin[i]->coldim;
    if(mxptrin[i]->crosslink) { /* Already linked. Add to circular list */
      mxptrout[i]->crosslink = mxptrin[i]->crosslink;
      mxptrin[i]->crosslink = mxptrout[i];
    } else { /* No links. Link going both ways */
      mxptrout[i]->crosslink = mxptrin[i];
      mxptrin[i]->crosslink = mxptrout[i];
    }
  }

  return(out);
}

#endif

/* the gateway function */
void mexFunction( int nlhs, mxArray *plhs[],
                  int nrhs, const mxArray *prhs[])
{
  mxArray **cell;
  int len;
  int i;

  if(nrhs < 1)
    mexErrMsgTxt("One input required.");
  
  if(!mxIsCell(prhs[0]))
    mexErrMsgTxt("First argument must be cell");

  if(mxGetNumberOfDimensions(prhs[0]) != 2)
    mexErrMsgTxt("First argument must be cellstr");

  for(i=0; i<mxGetNumberOfElements(prhs[0]); i++)
    if(!mxIsChar(mxGetCell(prhs[0], i)))
      mexErrMsgTxt("First argument must be a cell array of strings");
  /* Make a copy of the cell array to modify. Safe and slow or hack
     and fast */
#ifdef MUCK_INTERNALLY
  plhs[0] = smartCellStrDup(prhs[0]);
#else
  plhs[0] = mxDuplicateArray(prhs[0]);
#endif

  /* This certainly violates the MEX access rules for cell
     arrays. But qsort needs an array of things to work with.
     It appears that a cell array works just like any other
     Matlab array, but the elements are more mxArray pointers.
     So mxGetData will return a mxArray ** type, and as I've
     coded it below, cell[0] is the mxArray * for the first
     element in the cell array, cell[1] is the second etc.

     With this we just define a custom comparison function that
     operates on mxArray * elements, assuming strings in there */

  cell = (mxArray **)mxGetData(plhs[0]);
  len = mxGetNumberOfElements(plhs[0]);

  qsort(cell, len, sizeof(mxArray *), compfun);
}

---end cellsort.c---

---begin mxinternals.h---
#ifndef MXINTERNALS_H
#define MXINTERNALS_H 1


#define MXVARNORMAL (0)
#define MXVARPERSIST (1)
#define MXVARGLOBAL (2)
#define MXVARSUBEL (3)
#define MXVARTEMP (4)

#define MXLOGICALMASK 0x02
#define MXSCALARMASK 0x01

/* Blatant disregard for MATLAB API. This allows you
   to poke around an mxArray structure directly. Use
   with extreme caution! May not be portable across
   platforms or versions, may cause Matlab segfaults,
   may cause your computer to explode and eat all your data... */
struct mxArray_tag {
  char name[mxMAXNAM];
  int class;
  int vartype;
  mxArray *crosslink;
  int number_of_dims;
  int nelements_allocated;
  int dataflags;
  int rowdim;
  int coldim;
  union {
    struct {
      void *pdata;
      void *pimag_data;
      void *irptr;
      void *jcptr;
      int reserved;
      int nfields;
    } number_array;
  } data;
};

#endif

---end mxinternals.h---

Tags for this Thread

No tags are associated with this thread.

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us