Copyright (c) 2016, The MathWorks, Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the distribution.
* In all cases, the software is, and all modifications and derivatives
of the software shall be, licensed to you solely for use in conjunction
with MathWorks products and service offerings.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
timo (view profile)
I havent understood too much from the emails but I liked the undertones of corporate talk (Intel). Very funny for me
Cleve has given us a wonderful history of the Pentium division bug.
The fact that a chip can have a bug in its floating point arithmetic is not surprising, as one of the articles pointed out. Intel makes terrific products, and still does. Where Intel went wrong was to rely on a probabilistic approach to estimate how serious the flaw is. Even if it's the right answer (which it isn't) to the question "how often will this occur", the question itself is wrong. The right question should be "will this seriously affect my application?" for which the right answer is "I don't know."
Even if the bug occurs "randomly" every 9 billion divisions, the Intel white paper neglects to consider the fact that the result of one floating-point operation is most typically used as the input to the next. If an entire calculation takes 9 billion floating-point operations, and one fails catastrophically (resulting only a few significant digits) then it's likely that the whole calculation will be off. The assumption is flawed, but at least the white paper states clearly its assumption that these 9 billion divisions are independent.
The whole history of the Intel FDIV would have been quite different had Intel posted both the bug and a software workaround (even a naive one that could be improved later), and quickly provided that software workaround to compiler developers. The lesson here is to not let the "blogosphere" run away with your story, but to be up-front. The best answer Intel should have given when asked if this is a serious flaw is "We don't know, but here's a workaround." An even better answer is to not let the question be asked at all by your customers, but to state the solution before they even know to ask.
There's gem of numerical analysis in the Intel white paper, "Statistical Analysis of Floating Flag in the Pentium Processor.? It states that Gaussian elimination on an n-by-n matrix takes O(n^3) floating point operations (which it does), but it only performs O(n) divisions. Now, it *can* be written like that, by taking the reciprocal of the pivot, and multiplying the pivot column by the reciprocal. However, that method is less accurate than dividing the pivot, for two reasons: (1) in the typical case, you get two rounding errors instead of one, and (2) one over a denormal number is IEEE Infinity, but dividing one denormal by another is quite safe.
It was Cleve himself that point this out to me. I made the same mistake (initially), in the sparse backslash (MATLAB uses software that I wrote - UMFPACK and CHOLMOD). Have no fear ... the sparse backslash performs O(nnz(L)) floating-point divisions, and not O(n). In the dense case, then, an accurate method for Gaussian elimination will perform O(n^2) divisions, not O(n). As a ratio to total work, sparse matrix factorization performs more divisions than dense matrix factorization (the flops per nnz(L) ratio is n for a dense matrix, but is much lower than n for a sparse matrix).
What is more problematic about the analysis in the Intel white paper is that for this problem, at least, the assumption that the divisions are independent cannot be held. An error in the first one will cause all subsequent floating point operations to be inaccurate, since their inputs are all inaccurate. However, they continue with this assumption in Figure 6-1, where the frequency of error is estimated as K*P1*P2*P3 problems per year, and where P1 is the mere 1 over 9 billion estimate for ?independent? floating-point divisions.
Thanks 4 ur good werk