How to extract specific columns and rows from a large CSV file?

Question

Marylin on 14 Jun 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/344746-how-to-extract-specific-columns-and-rows-from-a-large-csv-file

Answered: dpb on 17 Jun 2017

Hi everybody I have few questions. I have some HUGE CSV files which I need in Matlab for analysis. I want to extract some specific columns and rows from a csv file from the BLS (Bureau of Labor Statistic): https://www.bls.gov/cew/data/files/2016/csv/2016_qtrly_singlefile.zip

I'd like to extract: - area_fips (column 1) - own_code (column 2) - industry_code (column 3) - year (column 6) - qtr (column 7) - month3_emplvl (column 12) - lq_month3_emplvl (column 21)

Also, I would like to extract the information only for the rows containing the following area_fips codes (C3562, C4266, C1698, C4194 and C4186).

Thanks so much!!

9 Comments
Show 7 older commentsHide 7 older comments

Marylin on 15 Jun 2017

Edited: Guillaume on 15 Jun 2017

Open in MATLAB Online

Hi, thanks a lot for your help. is it more clear this way?

area_fips,"own_code","industry_code","agglvl_code","size_code","year","qtr","disclosure_code","qtrly_estabs","month1_emplvl","month2_emplvl","month3_emplvl","total_qtrly_wages","taxable_qtrly_wages","qtrly_contributions","avg_wkly_wage","lq_disclosure_code","lq_qtrly_estabs","lq_month1_emplvl","lq_month2_emplvl","lq_month3_emplvl","lq_total_qtrly_wages","lq_taxable_qtrly_wages","lq_qtrly_contributions","lq_avg_wkly_wage","oty_disclosure_code","oty_qtrly_estabs_chg","oty_qtrly_estabs_pct_chg","oty_month1_emplvl_chg","oty_month1_emplvl_pct_chg","oty_month2_emplvl_chg","oty_month2_emplvl_pct_chg","oty_month3_emplvl_chg","oty_month3_emplvl_pct_chg","oty_total_qtrly_wages_chg","oty_total_qtrly_wages_pct_chg","oty_taxable_qtrly_wages_chg","oty_taxable_qtrly_wages_pct_chg","oty_qtrly_contributions_chg","oty_qtrly_contributions_pct_chg","oty_avg_wkly_wage_chg","oty_avg_wkly_wage_pct_chg"
01000,"0","10","50","0","2016","1","",120633,1881580,1891681,1902460,20681220155,8966282571,142397793,841,"",1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,"",3300,2.8,28991,1.6,28480,1.5,29536,1.6,246994696,1.2,234401333,2.7,-34642787,-19.6,-3,-0.4
C3562,"0","10","50","0","2016","2","",121686,1920068,1922160,1923565,20863580907,2587118737,42079504,835,"",1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,"",3315,2.8,33036,1.8,25026,1.3,23756,1.3,686673286,3.4,49777292,2.0,-11275229,-21.1,16,2.0

Marylin on 15 Jun 2017

Edited: Marylin on 15 Jun 2017

To see if I'd misspecified some columns, I tried to import only the 3 first columns with:

fmt=('%s%q%q%*[^\n]');

fid = fopen(filename,'r');

data=textscan(fid,'fmt','headerlines',1,'collectoutput',1,'delimiter',',');

fid=fclose(fid);

and I still got a data variable that is a 1X0 cell

dpb on 15 Jun 2017

See updated Answer...

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 14 Jun 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/344746-how-to-extract-specific-columns-and-rows-from-a-large-csv-file#answer_270705

Edited: dpb on 15 Jun 2017

Open in MATLAB Online

Use textscan with the appropriate formatting string to read and skip the necessary columns. There are examples in the documentation on how to do that--short answer is use an asterisk ('*') in front of any field to skip.

It will be much simpler to read the full length of the file into memory and do the value selection from there--otherwise you have to parse every line on reading to decide to keep or not...if even the subsections of columns won't fit in memory at one time, read large chunks in a loop and pare them down to what is/isn't needed per each and build the desired result piecewise. Remember to preallocate a large section and fill rather than dynamically reallocating every time.

From the revised file

fmt=['%s %q %q %*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];

Check counts, the above is:

read a string (C 1)
read two quoted strings (C 2,3)
skip two quoted strings (C 4,5 X)
read two quoted strings (C 6,7)
skip four generic fields as quoted strings
read a float (C12)
skip eight generic fields as quoted strings
read a float (C21)
skip rest of line

I put your three lines in a file locally==

>> data=textscan(fid,fmt,'headerlines',1','delimiter',',','collectoutput',1)
data = 
  {2x5 cell}    [2x2 double]
>> data{:}
ans = 
  '01000'    '0'    '10'    '2016'    '1'
  'C3562'    '0'    '10'    '2016'    '2'
ans =
   1902460           1
   1923565           1
>>

which looks to be correct columns and data for same...

To strip to the desired rows,

ix=ismember(data(:,1),CellArrayList);  % CellArrayList is your list of wanted
data=data(ix,:);                       % and keep only those rows

7 Comments
Show 5 older commentsHide 5 older comments

dpb on 16 Jun 2017

Edited: dpb on 16 Jun 2017

Open in MATLAB Online

At what point did you run out of memory? And, just how much data is in the file that are records of interest? The above solution used "the easy way out" to select the specific codes after all the file was read.

Alternative way to go at it --

Read a chunk of records at a time and search those for records of interest on each...10,000 records ought to fit comfortably just as round number. You could then write the records found on each iteration to a file to ensure don't run out of memory during the process and have the required data segregated. Worry about next step after that.

That just entails putting a numeric repetition count in the textscan call after the format string and then putting that into a

while ~feof(fid)
  ...
end

loop to continue until run out of data. You could also do a check that once you've found the last of the wanted codes to exit the loop; no need to read a bunch more if there isn't anything else of interest going to be found. That, of course, assumes the codes are segregated and not randomly scattered throughout the file.

Or, perhaps where should have headed from the git-go, look at the datastore and tall arrays. There's a whole section on handling large data <Large Files> in the documentation. I really figured it would fit once cut out the superfluous columns and haven't every had need to use the features so don't know them well...

Marylin on 16 Jun 2017

Edited: Marylin on 16 Jun 2017

Ok, I'll look to the datastore object. I think reading the file by chunk of 10000 rows at the time won't be a performant thing since 1. I don't really know how much rows there is in the file because I've never been able to open it ... 2. this is the quarterly file and I know there were more than 3.5 millions rows in the annual one ... so 3.5*4 = roughly 14 millions rows!!!

Thanks for your help! Have a good weekend!

Sign in to comment.

Answer 2

dpb on 16 Jun 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/344746-how-to-extract-specific-columns-and-rows-from-a-large-csv-file#answer_270967

Edited: dpb on 16 Jun 2017

Open in MATLAB Online

Alternative solution albeit not totally Matlab, it worked pretty well here it seems--

Rather than trying to parse the humongous file when only want a small fraction therefrom, I used grep to filter out the lines with the desired area codes and then just read it. It took almost no time then for textscan to read the file and took only a few seconds for grep. Here's the command line for grep:

grep -E -f "fips.lst" 2016.q1-q4.singlefile.csv >fipsdata.csv

where the input file of wanted areas contained

where the order of the above is sorted as that will make for the fastest search since they are in numerical order in the file.

Then the code we had before works just fine excepting there's no header line in the output file...

fmt=[repmat('%q',1,3) '%*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];
fid=fopen('data\fipsdata.csv','r');
data=textscan(fid,fmt,'headerlines',0','delimiter',',','collectoutput',1);
fid=fclose(fid); clear fid

Results are

>> whos data
Name      Size               Bytes  Class    Attributes
data      1x2             14115464  cell               
>> data(1)
ans = 
  {40516x5 cell}
>> data(2)
ans = 
  [40516x2 double]
>> data{1}(1:4,:)
ans = 
  'C1698'    '0'    '10'    '2016'    '1'
  'C1698'    '0'    '10'    '2016'    '2'
  'C1698'    '0'    '10'    '2016'    '3'
  'C1698'    '0'    '10'    '2016'    '4'
>> data{2}(1:4,:)
ans =
   4378686           1
   4504606           1
   4483985           1
   4497614           1
>>

It appears that while the other fields in the database that are strings other than the FIPS_AREA code are, in fact, numeric and could be converted or read as such. Not knowing what they really are other than year say, don't know if that would be the best course of action or to turn most of them into categorical variables, maybe? Anyway, if your analyses will be confined to relatively small groups of these IDs such as this, this may be a more fruitful way to attack reading the huge file rather than trying to deal with it all in Matlab. I'm sure we could work around it in one of the above manners, but this is pretty simple. I admit I've not tried the |dateaset| or other route as yet...

This is the version of grep for Win332 I'm using <tcharron grep.html>, I don't yet have a 64-bit system altho I'm sure there are bound to be ports for it I've not looked.

ADDENDUM

I'd forgotten CMD FINDSTR -- it actually has sufficient "oomph" to return the identical result in roughly the same time. I'm sure it's also built around a version of grep of course so not too surprising but I didn't recall it had the switches implemented it does...

C:\ML_R2014b\work\data> findstr  -G:"fips.lst" 2016.q1-q4.singlefile.csv >lines2.dat
C:\ML_R2014b\work\data> dir line*.*
Volume in drive C is unlabeled      Serial number is BC9D:AAD0
Directory of  c:\ml_r2014b\work\data\line*.*
lines.dat     7134516   6/16/17  14:55   lines2.dat    7134516   6/16/17  15:54
...

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

dpb on 17 Jun 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/344746-how-to-extract-specific-columns-and-rows-from-a-large-csv-file#answer_271006

Open in MATLAB Online

And, just for documentation, the above (essentially) written in Matlab using the previously-mentioned looping construct...it uses a corollary file (here 'fips.lst') for the desired codes to be extracted; the same file as read by either of the external text search routines works--

fmt=[repmat('%q',1,3) '%*q %*q %q %q' repmat('%*q',1,4) '%f' repmat('%*q',1,8) '%f' '%*[^\n]'];
fips=sort(importdata('data\fips.lst'));          % read the code list, sort ascending per database order
fid=fopen('data\2016.q1-q4.singlefile.csv','r'); % open the real file
fgetl(fid);                                      % skip the header record
N=500000;                                        % a sizable block size per each read
i=0;                                             % counter for the group found
n=0;                                             % info counter--accumulate number found
while ~feof(fid)                                 % start, read 'til out of data
  tic
  data=textscan(fid,fmt,N,'headerlines',0','delimiter',',','collectoutput',1);
  ix=ismember(data{1}(:,1),fips);                % find out if have any wanted
  if any(ix)                                     % and did, so save
    i=i+1;                                       % increment the cell index array
    c{i,1}=data{1}(ix,:);                        % and keep the rows of interest only
    c{i,2}=data{2}(ix,:);
    n=n+sum(ix);                                 % information statistics -- how many?
    fprintf('found; %5d total %8d\n',sum(ix),n)  % for progress reporting only
  end
  clear data                                     % somewhat spurious while tweaking...
  t=toc;                                         % just see how long it takes to 
  fprintf('%.2f sec %8dK records\n',t,i*N/1000)  % read a buffer and how many total read
end
fid=fclose(fid); clear fid                       % done with the file now
data{1,1}=vertcat(c{:,1}); d{1,2}=vertcat(c{:,2}); % return in cell arrays of n rows

Only other (yet untried) method would be the aforementioned dataset but either of these last two is pretty straightforward as long as the number selected overall fits in memory reasonably easily.

On this 20-yo machine the grep solution outperformed the Matlab textscan by a little but not terribly so and one presumes the major effort is after reading the data, not reading different sets of data repetitively.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How to extract specific columns and rows from a large CSV file?

9 Comments
Show 7 older commentsHide 7 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

How to extract specific columns and rows from a large CSV file?

9 Comments Show 7 older commentsHide 7 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

9 Comments
Show 7 older commentsHide 7 older comments

7 Comments
Show 5 older commentsHide 5 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments