Numeric column not read as string with readtable on XLS file

38 views (last 30 days)
I have an XLS data file with a column that looks like a number, but it is actually a unique string. I cannot control the design of this file type.
I am explicitly defining that the column (by variable name) is a string after generating a readtable opts sctruct, with either detectImportOptions or sheetImportOptions. The imported data must be passed through a sprintf/compose because the result is unworkable.
The following are some unique strings directly copied from my XLS file.
9.26433554
5.28131573
5.28131575
5.28131574
5.28131785
5.28131784
7.21999049
7.21999624
7.21999662
7.22000179
7.22001452
7.22004156
7.22004157
Yet, when I read in the XLS file with readtable, again explicitly stating the column is a string, it returns this:
"9.26434"
"5.28132"
"5.28132"
"5.28132"
"5.28132"
"5.28132"
"7.21999"
"7.22"
"7.22"
"7.22"
"7.22001"
"7.22004"
"7.22004"
Notice that several of the numbers are no longer unique. This is incredibly frustrating, because I generated a similar script to work on another matched dataset, and it reads those values correctly, though they are XLSX.
I've tried compose the doubles into strings with the correct width but it isn't matching about 5% of the data that is there. I've tried generating a script with uiopen, and set UseExcel to true and false; nothing is working.
I'm running MATLAB R2021b in Windows 10 with Office 365 (16.*).
  2 Comments
Jacob Lynch August
Jacob Lynch August on 1 Dec 2021
I edited the original posting to include a sample of the data. Column D, History.UUT_ID, is the offender.

Sign in to comment.

Answers (2)

Walter Roberson
Walter Roberson on 1 Dec 2021
Excel stores "general numbers" in .xls files as binary floating point ("XNUM"), not as text.
Probably you could also use ActiveX to tell Excel to change the format.
Because the value is internally stored in floating point, when you tell MATLB to read it as string, it does string() on the value. Unfortunately there does not appear to be any way to control the precision when doing that :(
The situation is different, at least in theory, for .xlsx files: .xlsx files do not store values in binary floating point, and instead store numbers as text in a .XML file. With .xlsx files, it is at least possible in theory to unzip the .xlsx file and open the .xml and parse out the text of the number. Older versions of MATLAB did exactly that: from roughly R2013b to R2018a, readtable() for xlsx files was implemented as MATLAB code that pulled out the .xml text and used regexp() to pull out the pieces. But that is theory; in practice, for .xlsx files you get the same difficulty that forcing string output for what was coded as a numeric field results in string() being applied with no control over the number of digits.
This difficulty does not exist for fields stored as text -- though historically there have been releases where leading zeros were dropped from text fields that happened to look like numbers.

Star Strider
Star Strider on 1 Dec 2021
The readtable function has no problems with that variable (that appears to be ‘History.UUT_ID’) and reads it as a double. I imported it here, not offline on my computer, however I suspect it will import the same way, since I also have R2021b there. The ‘History.SENSOR_SN’ variable was imported as a cell array of strings, however that is easily converted using the str2double function —
T1 = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/819254/EData0001.xls', 'VariableNamingRule','preserve')
T1 = 13×10 table
History.CALIBRATION_TIMESTAMP History.SENSOR_SN History.SENSOR_PROD_CODE History.UUT_ID History.CAL_TIME_MIN History.SUCCESS_FLAG History.FUNCTION_NAME History.FAILURE_MESSAGE History.GROUP History.WORK_ID _____________________________ _________________ ____________________________________ ______________ ____________________ ____________________ _____________________ ________________________________________ ___________________ _______________ 30-Sep-2019 14:24:01 {'15068231' } {'8750WDMT2A1FNSA080CDDM4D1Q4PD' } 9.2643 144.68 1 {'FINGERPRINT'} {0×0 char } {'B' } 2.0534e+07 30-Sep-2019 20:54:54 {'13292629' } {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 59.15 0 {'CAL' } {'divisor is equal to zero↵at line 902'} {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 30-Sep-2019 22:20:37 {'13292651' } {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 144.87 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 30-Sep-2019 22:20:38 {'13292631' } {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 144.88 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 01-Oct-2019 00:25:35 {'13292629' } {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 111.68 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 01-Oct-2019 00:25:35 {'13292630' } {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 111.68 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 05-Oct-2019 05:13:36 {'14960506' } {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 25.62 0 {'VERIFY' } {'verify failed' } {'E' } NaN 06-Oct-2019 06:22:36 {'14960506' } {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 26.52 0 {'CAL' } {'REPEATABILITY FAILURE' } {'E' } NaN 06-Oct-2019 06:46:55 {'14960506' } {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 22.08 0 {'CAL' } {'REPEATABILITY FAILURE' } {'E' } NaN 07-Oct-2019 08:27:18 {'14960506.1'} {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 87.65 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 07-Oct-2019 15:20:33 {'14960505.1'} {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 86.17 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 08-Oct-2019 12:47:38 {'14960933.1'} {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 109.3 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 08-Oct-2019 12:47:40 {'14960932' } {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 109.33 1 {'FINGERPRINT'} {0×0 char } {'E' } NaN
T1.('History.SENSOR_SN') = str2double(T1.('History.SENSOR_SN'))
T1 = 13×10 table
History.CALIBRATION_TIMESTAMP History.SENSOR_SN History.SENSOR_PROD_CODE History.UUT_ID History.CAL_TIME_MIN History.SUCCESS_FLAG History.FUNCTION_NAME History.FAILURE_MESSAGE History.GROUP History.WORK_ID _____________________________ _________________ ____________________________________ ______________ ____________________ ____________________ _____________________ ________________________________________ ___________________ _______________ 30-Sep-2019 14:24:01 1.5068e+07 {'8750WDMT2A1FNSA080CDDM4D1Q4PD' } 9.2643 144.68 1 {'FINGERPRINT'} {0×0 char } {'B' } 2.0534e+07 30-Sep-2019 20:54:54 1.3293e+07 {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 59.15 0 {'CAL' } {'divisor is equal to zero↵at line 902'} {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 30-Sep-2019 22:20:37 1.3293e+07 {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 144.87 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 30-Sep-2019 22:20:38 1.3293e+07 {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 144.88 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 01-Oct-2019 00:25:35 1.3293e+07 {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 111.68 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 01-Oct-2019 00:25:35 1.3293e+07 {'8750WDMR2A2FTHA060SA1M4C1D1Q5Q8' } 5.2813 111.68 1 {'FINGERPRINT'} {0×0 char } {'¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿'} 2.0526e+07 05-Oct-2019 05:13:36 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 25.62 0 {'VERIFY' } {'verify failed' } {'E' } NaN 06-Oct-2019 06:22:36 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 26.52 0 {'CAL' } {'REPEATABILITY FAILURE' } {'E' } NaN 06-Oct-2019 06:46:55 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 22.08 0 {'CAL' } {'REPEATABILITY FAILURE' } {'E' } NaN 07-Oct-2019 08:27:18 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 87.65 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 07-Oct-2019 15:20:33 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 86.17 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 08-Oct-2019 12:47:38 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 109.3 1 {'FINGERPRINT'} {0×0 char } {'N' } 2.0801e+07 08-Oct-2019 12:47:40 1.4961e+07 {'8750WDMT2A2FPSA015CDHDA1M4C1D1Q4'} 7.22 109.33 1 {'FINGERPRINT'} {0×0 char } {'E' } NaN
Col2 = T1.('History.SENSOR_SN');
Col4 = T1.('History.UUT_ID');
whos Col2 Col4
Name Size Bytes Class Attributes Col2 13x1 104 double Col4 13x1 104 double
Try that approach, and be certain that all the latest updates are installed. The readmatrix function likely also works with the same conversion (I didn’t experiment with it). I prefer the extra information readtable provides.
.
  3 Comments
Walter Roberson
Walter Roberson on 1 Dec 2021
Comparing floating point numbers for equality was always going to be problematic. You can't count on systems to retain last-digit accuracy when handling numbers represented as text.
For example MATLAB's "format long g" skips outputing the last digit that would be needed to definitely resolve to the last bit, so about 1 time in 3 a value output using "format long g" format will not read in exactly the same as the original number.
Star Strider
Star Strider on 1 Dec 2021
As Walter notes, exact equality is likely not an option. Tolerances are usually required, so depending on the application, ismembertol and uniquetol take this into account, and it is usually possible to include a tolerance in other instances, depending on the application. The discussion and examples in the documentation section on Group Scattered Data Using a Tolerance is particularly informative in that respect.
.

Sign in to comment.

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!