How to extract specific columns and rows from a large CSV file?

73 views (last 30 days)
Hi everybody I have few questions. I have some HUGE CSV files which I need in Matlab for analysis. I want to extract some specific columns and rows from a csv file from the BLS (Bureau of Labor Statistic): https://www.bls.gov/cew/data/files/2016/csv/2016_qtrly_singlefile.zip
I'd like to extract: - area_fips (column 1) - own_code (column 2) - industry_code (column 3) - year (column 6) - qtr (column 7) - month3_emplvl (column 12) - lq_month3_emplvl (column 21)
Also, I would like to extract the information only for the rows containing the following area_fips codes (C3562, C4266, C1698, C4194 and C4186).
Thanks so much!!
  9 Comments
Marylin
Marylin on 15 Jun 2017
Edited: Marylin on 15 Jun 2017
To see if I'd misspecified some columns, I tried to import only the 3 first columns with:
fmt=('%s%q%q%*[^\n]');
fid = fopen(filename,'r');
data=textscan(fid,'fmt','headerlines',1,'collectoutput',1,'delimiter',',');
fid=fclose(fid);
and I still got a data variable that is a 1X0 cell

Sign in to comment.

Accepted Answer

dpb
dpb on 14 Jun 2017
Edited: dpb on 15 Jun 2017
Use textscan with the appropriate formatting string to read and skip the necessary columns. There are examples in the documentation on how to do that--short answer is use an asterisk ('*') in front of any field to skip.
It will be much simpler to read the full length of the file into memory and do the value selection from there--otherwise you have to parse every line on reading to decide to keep or not...if even the subsections of columns won't fit in memory at one time, read large chunks in a loop and pare them down to what is/isn't needed per each and build the desired result piecewise. Remember to preallocate a large section and fill rather than dynamically reallocating every time.
From the revised file
fmt=['%s %q %q %*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];
Check counts, the above is:
  1. read a string (C 1)
  2. read two quoted strings (C 2,3)
  3. skip two quoted strings (C 4,5 X)
  4. read two quoted strings (C 6,7)
  5. skip four generic fields as quoted strings
  6. read a float (C12)
  7. skip eight generic fields as quoted strings
  8. read a float (C21)
  9. skip rest of line
I put your three lines in a file locally==
>> data=textscan(fid,fmt,'headerlines',1','delimiter',',','collectoutput',1)
data =
{2x5 cell} [2x2 double]
>> data{:}
ans =
'01000' '0' '10' '2016' '1'
'C3562' '0' '10' '2016' '2'
ans =
1902460 1
1923565 1
>>
which looks to be correct columns and data for same...
To strip to the desired rows,
ix=ismember(data(:,1),CellArrayList); % CellArrayList is your list of wanted
data=data(ix,:); % and keep only those rows
  7 Comments
dpb
dpb on 16 Jun 2017
Edited: dpb on 16 Jun 2017
At what point did you run out of memory? And, just how much data is in the file that are records of interest? The above solution used "the easy way out" to select the specific codes after all the file was read.
Alternative way to go at it --
Read a chunk of records at a time and search those for records of interest on each...10,000 records ought to fit comfortably just as round number. You could then write the records found on each iteration to a file to ensure don't run out of memory during the process and have the required data segregated. Worry about next step after that.
That just entails putting a numeric repetition count in the textscan call after the format string and then putting that into a
while ~feof(fid)
...
end
loop to continue until run out of data. You could also do a check that once you've found the last of the wanted codes to exit the loop; no need to read a bunch more if there isn't anything else of interest going to be found. That, of course, assumes the codes are segregated and not randomly scattered throughout the file.
Or, perhaps where should have headed from the git-go, look at the datastore and tall arrays. There's a whole section on handling large data <Large Files> in the documentation. I really figured it would fit once cut out the superfluous columns and haven't every had need to use the features so don't know them well...
Marylin
Marylin on 16 Jun 2017
Edited: Marylin on 16 Jun 2017
Ok, I'll look to the datastore object. I think reading the file by chunk of 10000 rows at the time won't be a performant thing since 1. I don't really know how much rows there is in the file because I've never been able to open it ... 2. this is the quarterly file and I know there were more than 3.5 millions rows in the annual one ... so 3.5*4 = roughly 14 millions rows!!!
Thanks for your help! Have a good weekend!

Sign in to comment.

More Answers (2)

dpb
dpb on 16 Jun 2017
Edited: dpb on 16 Jun 2017
Alternative solution albeit not totally Matlab, it worked pretty well here it seems--
Rather than trying to parse the humongous file when only want a small fraction therefrom, I used grep to filter out the lines with the desired area codes and then just read it. It took almost no time then for textscan to read the file and took only a few seconds for grep. Here's the command line for grep:
grep -E -f "fips.lst" 2016.q1-q4.singlefile.csv >fipsdata.csv
where the input file of wanted areas contained
C1698
C3562
C4186
C4194
C4266
where the order of the above is sorted as that will make for the fastest search since they are in numerical order in the file.
Then the code we had before works just fine excepting there's no header line in the output file...
fmt=[repmat('%q',1,3) '%*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];
fid=fopen('data\fipsdata.csv','r');
data=textscan(fid,fmt,'headerlines',0','delimiter',',','collectoutput',1);
fid=fclose(fid); clear fid
Results are
>> whos data
Name Size Bytes Class Attributes
data 1x2 14115464 cell
>> data(1)
ans =
{40516x5 cell}
>> data(2)
ans =
[40516x2 double]
>> data{1}(1:4,:)
ans =
'C1698' '0' '10' '2016' '1'
'C1698' '0' '10' '2016' '2'
'C1698' '0' '10' '2016' '3'
'C1698' '0' '10' '2016' '4'
>> data{2}(1:4,:)
ans =
4378686 1
4504606 1
4483985 1
4497614 1
>>
It appears that while the other fields in the database that are strings other than the FIPS_AREA code are, in fact, numeric and could be converted or read as such. Not knowing what they really are other than year say, don't know if that would be the best course of action or to turn most of them into categorical variables, maybe? Anyway, if your analyses will be confined to relatively small groups of these IDs such as this, this may be a more fruitful way to attack reading the huge file rather than trying to deal with it all in Matlab. I'm sure we could work around it in one of the above manners, but this is pretty simple. I admit I've not tried the |dateaset| or other route as yet...
This is the version of grep for Win332 I'm using <tcharron grep.html>, I don't yet have a 64-bit system altho I'm sure there are bound to be ports for it I've not looked.
ADDENDUM
I'd forgotten CMD FINDSTR -- it actually has sufficient "oomph" to return the identical result in roughly the same time. I'm sure it's also built around a version of grep of course so not too surprising but I didn't recall it had the switches implemented it does...
C:\ML_R2014b\work\data> findstr -G:"fips.lst" 2016.q1-q4.singlefile.csv >lines2.dat
C:\ML_R2014b\work\data> dir line*.*
Volume in drive C is unlabeled Serial number is BC9D:AAD0
Directory of c:\ml_r2014b\work\data\line*.*
lines.dat 7134516 6/16/17 14:55 lines2.dat 7134516 6/16/17 15:54
...

dpb
dpb on 17 Jun 2017
And, just for documentation, the above (essentially) written in Matlab using the previously-mentioned looping construct...it uses a corollary file (here 'fips.lst') for the desired codes to be extracted; the same file as read by either of the external text search routines works--
fmt=[repmat('%q',1,3) '%*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];
fips=sort(importdata('data\fips.lst')); % read the code list, sort ascending per database order
fid=fopen('data\2016.q1-q4.singlefile.csv','r'); % open the real file
fgetl(fid); % skip the header record
N=500000; % a sizable block size per each read
i=0; % counter for the group found
n=0; % info counter--accumulate number found
while ~feof(fid) % start, read 'til out of data
tic
data=textscan(fid,fmt,N,'headerlines',0','delimiter',',','collectoutput',1);
ix=ismember(data{1}(:,1),fips); % find out if have any wanted
if any(ix) % and did, so save
i=i+1; % increment the cell index array
c{i,1}=data{1}(ix,:); % and keep the rows of interest only
c{i,2}=data{2}(ix,:);
n=n+sum(ix); % information statistics -- how many?
fprintf('found; %5d total %8d\n',sum(ix),n) % for progress reporting only
end
clear data % somewhat spurious while tweaking...
t=toc; % just see how long it takes to
fprintf('%.2f sec %8dK records\n',t,i*N/1000) % read a buffer and how many total read
end
fid=fclose(fid); clear fid % done with the file now
data{1,1}=vertcat(c{:,1}); d{1,2}=vertcat(c{:,2}); % return in cell arrays of n rows
Only other (yet untried) method would be the aforementioned dataset but either of these last two is pretty straightforward as long as the number selected overall fits in memory reasonably easily.
On this 20-yo machine the grep solution outperformed the Matlab textscan by a little but not terribly so and one presumes the major effort is after reading the data, not reading different sets of data repetitively.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!