Find common files in two directories

5 views (last 30 days)
I've got a script that I'm using to make comparison plots using data from two different sets of simulation runs. The data is stored in mat files in different folders. I use uigetdir to select the folders and then I search through the mat files in each folder to find matching test cases so I can make comparison plots for each matching test case.
I have a working solution. However, I don't understand why I have to go through two stages of conversion to finally get a cell array of strings containing the unique test case numbers so that I can sort them into numerical order instead of "ASCII dictionary order".
Here is the relevant code excerpt:
%%Find matching test cases in the two directories
o = dir(fullfile(dirold, '*.mat')); % 'old'
n = dir(fullfile(dirnew, '*.mat')); % 'new'
[C, iold, inew] = intersect({o.name}, {n.name}); % find common test case files in 'old' and 'new' directories
% Convert C to a sortable array of indices so the comparison plots will be in order by case number
cases = regexp(C, 'case(\d*)', 'tokens'); % extract case numbers (cell array of cells?)
for (iC = 1:length(cases)) % TODO: why do I need to do this step?
x(iC) = cases{iC}; %#ok<SAGROW>
end
for (iC = 1:length(cases)) % convert to cell array of strings
y(iC) = x{iC}; %#ok<SAGROW>
end
% Re-sort in numerical order instead of 'ASCII dictionary order'
[~,iy] = sort(str2num(char(y))); %#ok<ST2NM>
%%For each test case that is in both directories, make some comparison plots
for (iCase = iy')
fprintf('\nFound matching test case ''%s''.\n', C{iCase});
od = load(fullfile(dirold, C{iCase})); % load 'old' data into 'od' struct
nd = load(fullfile(dirnew, C{iCase})); % load 'new' data into 'nd' struct
...
<make the plots>
end
My concern is this: why do I have to go through the two step process of creating the intermediate 'x' and 'y' so that I can finally get an sortable cell array of strings? Is there a way to do this that is more straightforward and less confusing? I don't understand why this is necessary and future users of this code (including myself) won't understand it either.
Any help to simplify this (or at least clarify what is going on and why this mess is necessary) would be much appreciated.
Note: the reason I want to do the 'numeric' sort is so that I get plots for the test cases in the order 1, 2, ..., 9, 10, 11, ... instead of 10, 11, ... 19, 1, 21, 22, ..., 2, 3, 4, ..., etc. The mat files are named caseX_... where X is the test case number. By default, the dir command, and hence the intersect command, are sorting by "ASCII dictionary order" which is not what I want.
  1 Comment
dpb
dpb on 10 Nov 2015
Solve the problem by renaming the files to include leading zeros in the numeric portion of the file name. Do this by using the appropriate format string when creating the names...
fn=num2str(caseIndex,'case%03d');

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 10 Nov 2015
Edited: Guillaume on 11 Nov 2015
There is reason behind your regexp returning a cell array of cell array of cell array. The outer cell array is simply because your C input is a cell array. So, the outer cell array is always the same size as C and each cell correspond to the matches for the corresponding string in C.
The second level of cell array is because for a given single string there may be several matches. So the matches themselves have to be returned in a cell array. (For example if you request to match 'a..', there are two matches in the string 'abcdaef': {'abc', 'aef'})
But it's not matches that you've requested, it's tokens (usually called captures in other languages). That adds another level and is the reason for the inner cell array. There may be several tokens per match, so the tokens for a match also have to be wrapped up in a cell array. For example, if you request to match 'a(.)(.)', there are two tokens per match so the tokens are {{'b', 'c'}, {'e', 'f'}} and the matches are as above).
In your case, you've only got one match and one token. You could actually get rid of these two levels of cell array.
To get rid of the cell array of tokens, simply ask for a match instead of tokens. There are many ways to build a regex. If you only want to capture a number preceded by a specific string, this would work:
cases = regexp(C, '(?<=case)\d+', 'match');
This matches one or more digit preceded by 'case' (using look-behind). That's one cell array level gone (the inner one)
To get rid of the cell array of matches, simply tell the regular expression engine you only want one match. This is done with the 'once' keyword:
cases = regexp(C, '(?<=case)\d+', 'match', 'once');
cases is then simply a string if C is a string, or a cell array of single strings if C is a cell array.
  2 Comments
Les Beckham
Les Beckham on 11 Nov 2015
Thank you so much. This provides a very clear explanation of the nesting of cells in cells in a cell array that I found so mystifying.
I did have to do some trial and error to find why your suggested regex didn't work. I found that there was an extra parenthesis. This is what I ended up using:
cases = regexp(C, '(?<=case)\d+', 'match', 'once');
Guillaume
Guillaume on 11 Nov 2015
Yes sorry, somehow an extra bracket found its way in the first regex. The final one, the same as what you ended up with was correct.

Sign in to comment.

More Answers (1)

Stephen23
Stephen23 on 10 Nov 2015
Edited: Stephen23 on 18 Apr 2021
You could download my FEX submission natsortfiles, which sorts filenames taking into account any numeric values in the filenames (and not ASCII order of those digits):
Using it is easy:
S = dir(..);
S = natsortfiles(S);
But as dpb already mentioned, the simplest and most robust approach is to use sufficient leading zeros so that a normal sort provides the correct order.
  1 Comment
Les Beckham
Les Beckham on 11 Nov 2015
Thanks for the suggestion Stephen. I doesn't quite apply to my specific problem, but I will definitely keep this in mind for possible use in the future. I actually might have been able to use this if I sorted the outputs from 'dir' using your utility before doing the 'intersect' but Guillaume provided a nice explanation that allowed me to fix my original approach with no outside dependencies.

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!