detecting Excel files without Excel extension

4 views (last 30 days)
I need to be able to identify an Excel file without using xlsfinfo. It is a VERY slow function. Without it, my algorithm runs faster. Any ideas?
Background: Our test setup outputs text files that are named with the ".xls" extension. Back in Excel 2003 and before, if you named a tab delimited text file with "xls", Excel would open the file no problem and properly interpret it. Now, it is causing me headaches with importing data into MATLAB. MATLAB uses the ".xls" extension to check for wrong input to "textscan" and "importdata." Here is where it gets interesting. When you open a text file with the ".xls" extension and then save it, it is now an excel file and no longer tab delimited text file. Basically, I need to be able to tell the difference between text and Excel files without xlsfinfo because xlsfinfo is very slow.

Accepted Answer

Iain
Iain on 23 May 2013
You can try something like
fid = fopen(filename,'r');
fifty_bytes = fread(fid,50,'uint8'); fclose(fid);
Valid_characters = ['A':'Z' 'a':'z' ',./\ 0123456789' 9 10 13 '!"£$%^&*()[]{}-=_+ ... and any other valid characters you expect in your text files. ]
excel_file = false;
for i = 1:50
if any(fifty_bytes ~= Valid_characters)
excel_file = true;
end
end
  3 Comments
Jason Nicholson
Jason Nicholson on 27 May 2013
Edited: Jason Nicholson on 27 May 2013
I am going to try this method. I will post more on my findings.
Jason Nicholson
Jason Nicholson on 27 May 2013
Edited: Jason Nicholson on 27 May 2013
This method seems to work really well and it is blazing fast!!! The only issue is I need to right some code to deal with xlWebArchive's and xlHtml's (see here for more information).
function result = isBinary(file)
%Looks for nonreadable characters in the 1st 100 bytes
%
% result = isBinary(file)
% result is bolean, 0 or 1. result is true/1 if file is an Excel
% file. result is 0 if file is text.
%
% Open file
fileID = fopen(file,'r');
% Get 100 bytes of binary data
binaryData = fread(fileID,100,'uint8');
fclose(fileID);
% Valid ASCII Characters
validCharacters = [9 10 12 13 32:126 128 130:140 142 145:156 158:255];
nonPrintableCharacters = setdiff(binaryData,validCharacters);
% Output results
if isempty(nonPrintableCharacters)
result = true;
else
result = false;
end
end % end function

Sign in to comment.

More Answers (2)

José-Luis
José-Luis on 23 May 2013
Edited: José-Luis on 23 May 2013
You could always check the file signature. You would need to read the file as a binary file and look into specific locations, using hexadecimal values.
Here is a snippet that will give you a string of the hexadecimal values:
fid = fopen('/path/to/test.xls','r');
uid = fread(fid,4,'ushort','b');
fseek(fid,520,'bof');
ftype = fread(fid,4,'ushort','b');
fclose(fid);
s1 = sprintf('%X',uid)
s2 = sprintf('%X',ftype)
If the file is an office document, s1 should be:
D0 CF 11 E0 A1 B1 1A E1 without the spaces
It it is an excel spreasheet, then s2 should be:
FD FF FF FF nn xx where the last two values (nn and xx) can change within a certain range.
It would be a matter of using strcmp(). You could of also compare hexadecimal values directly with fi(). For your purpose, it sounds like it would be enough to check if it is an Office document.
Text files carry no file identifier.
  4 Comments
José-Luis
José-Luis on 24 May 2013
What are you getting? Maybe it is a little endian system? You could try with:
uid = fread(fid,4,'ushort','l');
Same thing for ftype.
Jason Nicholson
Jason Nicholson on 27 May 2013
Edited: Jason Nicholson on 27 May 2013
Conclusion: I have spent about 20 hours on this method. I am finding it unreliable. I have analyzed 44,920 Excel files on our servers dating back to 1996. I will find a different way to identify excel files. I have two methods in mind to try. I will post more but not in this reply.
Details: I wrote a vbscript to return the FileFormat ( how to get it here ) property of the Excel Workbook. I used the script to congregate files of like Excel filetype before I looked for hex patterns. I found the following file types identified by the script:
  • Enumeration Format
  • -4158 xlCurrentPlatformText
  • 16 xlExcel2
  • 18 xlAddIn
  • 29 xlExcel3
  • 33 xlExcel4
  • 39 xlExcel5
  • 42 xlUnicodeText
  • 43 xlExcel9795
  • 44 xlHtml
  • 45 xlWebArchive
  • 46 xlXMLSpreadsheet
  • 50 xlExcel12
  • 51 xlOpenXMLWorkbook
  • 52 xlOpenXMLWorkbookMacroEnabled
  • 56 xlExcel8
  • 63 ????
Once I had identified the Excel files, I used TridScan.exe to look for binary patterns in each of the individual Excel filetypes. For instance, I scanned 26,811 xlExcel8 files with TridScan to find the following signature:
  • xlExcel8 (xls, 26,811 files analyzed)
  • Position Bytes
  • 0 D0CF11E0A1B11AE100000000000000000000000000000000
  • 25 000300FEFF0900060000000000000000000000
  • 46 0000
  • 51 000000000000100000
  • 65 000000
  • 73 000000
  • 79 00
If xlExcel8 was the only type, using signatures to identify Excel file would be easy. However, there are at least 11 different patterns that are unique among different Excel filetypes. Here are the rest of the signatures:
  • xlExcel2 (xls, 13 files analyzed)
  • Position Bytes
  • 0 0900040007001000
  • 16 00000
  • 31 00
  • 39 00
  • 47 00
  • 55 00
  • 75 00
  • 123 00
  • xlExcel4 (xls, 51 files analyzed)
  • Position Bytes
  • 0 0904060000001000
  • 10 5C00
  • 13 00
  • 28 202020202020202020202020202020202020
  • 145 00
  • 149 00
  • 157 00
  • 161 00
  • xlExcel5 (xls, 1067 files analyzed)
  • Position Bytes
  • 0 D0CF11E0A1B11AE100000000000000000000000000000000
  • 25 000300FEFF0900060000000000000000000000
  • 45 000000
  • 50 00000000000000100000
  • 65 000000
  • 73 000000
  • 78 0000
  • xlExcel9795 (xls, 483 files analyzed)
  • Position Bytes
  • 0 D0CF11E0A1B11AE1000000000000000000000000000000003E000300FEFF0900060000000000000000000000
  • 46 0000
  • 51 000000000000100000
  • 65 000000
  • 73 000000
  • 79 00
  • 519 00
  • 526 0000
  • 530 0000
  • 533 00
  • 535 00
  • 539 00
  • 541 000000
  • 545 000000
  • 549 00
  • 551 00
  • 553 0000
  • xlWebArchive (xls, 111 files analyzed)
  • Position Bytes
  • 37 6F72
  • xlXMLSpreadsheet (xls, 5 files analyzed)
  • Position Bytes
  • 40 20
  • 60 22
  • 104 63
  • 252 66
  • 282 74
  • 284 22
  • 530 2F
  • 1190 6F
  • xlExcel12 (xlsb, 9 files analyzed)
  • Position Bytes
  • 0 504B030414000600080000002100
  • 20 0000
  • 24 00001300
  • 30 5B436F6E74656E745F54797065735D2E786D6C20A2
  • 53 28A00002 more zeros
  • 1353 0000
  • 1358 00
  • 1364 00
  • 1366 00
  • 1368 000000
  • 1372 00
  • 1378 000000
  • 1383 0000
  • 1386 00
  • 1406 00
  • 1408 0000000000000000000000000000000000
  • xlOpenXMLWorkbook (xls, 22 files analyzed)
  • Position Bytes
  • 0 504B030414000600080000002100
  • 20 0000
  • 24 00001300
  • 30 5B436F6E74656E745F54797065735D2E786D6C20A2
  • 53 28A00002 more zeros
  • 1147 000000
  • 1153 00
  • 1159 00
  • 1161 00
  • 1167 00
  • 1173 000000
  • 1178 0000
  • 1181 00
  • 1201 00
  • 1203 00000000000000000 more zeros
  • 1901 00
  • 1907 00
  • 1909 00
  • 1911 000000
  • 1915 00
  • 1922 0000
  • 1926 0000
  • 1929 00
  • 1964 00
  • 1966 0000000000000000000000000000000000000000000000
  • xlOpenXMLWorkbook (xlsx, 11014 files analyzed)
  • Position Bytes
  • 9 00
  • 20 0000
  • 25 00
  • 27 00
  • xlOpenXMLWorkbookMacroEnabled (xlsm, 479 files analyzed)
  • Position Bytes
  • 9 000000
  • 13 00
  • 20 0000
  • 25 00
  • 27 00
  • 55 00
  • 58 0000
  • 61 000000
  • 65 000000
  • 72 0000000000000000
  • 81 000000
  • 85 000000

Sign in to comment.


Jason Nicholson
Jason Nicholson on 27 May 2013
I wrote a function that uses the Excel ActiveX server. It is still slow like MATLAB's xlsfinfo. However, it gains a little speed because it does not traverse the sheets in the excel file like xlsfinfo does.
I will use this function to benchmark other methods like checking the 1st 50kbytes for unreadable characters.
function result = isExcelFile(file)
%isExcelFile checks if the file is an actual Excel format rather than text
%with an incorrect file extension.
%
% [result] = isExcelFile(file)
% result is bolean, 0 or 1. result is true/1 if file is an Excel
% file. result is 0 if file is text.
%
try
oExcel = actxserver('excel.application');
catch exception
error('Unable to Open Excel through ActiveX server. Search "Excel.Application object reference" on Google for more information');
end
% Do not display alerts. Instead use the default action of each alert
oExcel.DisplayAlerts = 0;
% Open the workbook
oWorkbook = oExcel.workbooks.Open(file,false,true);
% Get the Excel file Format
excelFileFormat = oWorkbook.FileFormat;
if (strcmpi(excelFileFormat,'xlCurrentPlatformText')==1 || strcmpi(excelFileFormat,'xlUnicodeText')==1 || strcmpi(excelFileFormat,'xlTextMac')==1 || strcmpi(excelFileFormat,'xlTextMSDOS')==1 || strcmpi(excelFileFormat,'xlTextPrinter')==1 || strcmpi(excelFileFormat,'xlTextWindows')==1)
result = false;
else
result = true;
% For debugging
%disp(excelFileFormat);
end
try
% Close Workbook without saving
oWorkbook.Close(false);
% Close and clear Excel Application
oExcel.Quit;
oExcel.delete;
catch exception
disp('Unable to close workbook or exit Excel application');
end
end % end isExcelFile funciton

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!