Questions about 64-bit stuff

39 views (last 30 days)
Huy Truong
Huy Truong on 27 Jul 2015
Edited: Stephen23 on 28 Jul 2015
Here's my question based on what I understand from the book I'm reading. Hopefully, someone can understand my rough idea:
1) Say we have 64 bits. Each bit is either 0 or 1. So 64 bits can only have 64 spots to store 0s or 1s. But 10^380 is a very long (and huge number). We need 380 spots to write down 10^380, i.e., 10...000. Then how could it be possible for computer to store this number? I'm totally lost here.
2)"uint64" data type means computer require 64 bits to store such a number. The maximum integer of this type it can store is 2^64 - 1. Comparing to "double", which also uses 64 bits to store a fractions. Yet the largest number it can store is 1.79x10^380. 10^380 is a very very large number, in comparison to 2^64. How could this be? I mean why don't we just throw away (literally throw away) "uint64" because it uses the same amount of memory like "double" and can store even larger numbers.
Unless I'm crazy here, or misunderstand something. Someone please help explain.
Thanks.
  1 Comment
Walter Roberson
Walter Roberson on 27 Jul 2015
realmax() is 1.79769313486232e+308 not a number roughly 10^380

Sign in to comment.

Accepted Answer

Jan
Jan on 27 Jul 2015
You can write down 10^380 with even less than 64 bits: 6 bytes are enough (1 per character - it does not matter here that Matlab uses 2 bytes per char):
'1', '0', '^', '3', '8', '0'
But of course you have a limited accuracy with 6 characters. You can represent 11^381 also, but not 10.1^380. A similar effect occurs for the double format: You have one bit for the sign, some bits for the exponent and some for the mantissa. By this way you get about 16 digits and numbers up to 10^380. But you cannot store e.g. 18 valid digits in such a number due to the limited precision.
In uint64 you can store integers up to 2^64-1 exactly. The greater range of the double format is an effect of the limited precision of the mantissa. So you can see a double as 1 sign bit + a uint52 and a 11 bit exponent.

More Answers (3)

Image Analyst
Image Analyst on 27 Jul 2015
I think this should explain it all: https://en.wikipedia.org/wiki/IEEE_floating_point
  4 Comments
Steven Lord
Steven Lord on 27 Jul 2015
You might be interested in section 7 of the introduction chapter of Cleve's Numerical Computing with MATLAB.
The main assumption you're making that is not correct is that double precision numbers are equally spaced (in terms of absolute difference) throughout the range covered by double precision. This IS valid for uint64 (the uniform spacing is 1) but is NOT valid for double.
Muthu Annamalai
Muthu Annamalai on 27 Jul 2015
@Huy Truong - you just answered the question, "What is the difference between floating point and fixed point numbers ?"
Dynamic range
Read some of interesting MATLAB documentation on Fixed point numbers and Floating Point numbers

Sign in to comment.


Stephen23
Stephen23 on 27 Jul 2015
Edited: Stephen23 on 28 Jul 2015
The core difference is this:
  • floating point classes (e.g. double and single) split their total number of bits into three groups: the main part encodes the digits (or fraction), a smaller part encodes the magnitude, and one bit encodes the sign.
  • integer classes only encode the digits, and possibly the sign.
This means floating point numbers encode a value a bit like this:
X * ZZZZZZZZZZZZZZZ * 2^YYYYY
where the X is the sign bit, the Z's are the digits, and the Y's are the exponent (multiplier) bits. The advantage of doing this is it is possible to encode a reasonably large range of magnitudes (the range of 2^YYYYY) with the same precision (how many Z digits there are). Note it is not possible to represent all integers within that range!
An integer can be much simpler:
XZZZZZZZZZZZZZZZZZZZZ
Why do we not "throw away" the integer classes: they encode precise integer values right until their limits (see how there are more Z digits for the same number of bits) so their memory usage can be much more efficient, and because many operations can be applied directly to the bits themselves their operations can be faster.
  2 Comments
Huy Truong
Huy Truong on 27 Jul 2015
Edited: Huy Truong on 27 Jul 2015
1) "So their memory usage can be much more efficient". This is exactly what I don't understand: why is it faster? For example
x = uint64(5) % uint64 data type
x =
5
>> y = 7 % double data type
y =
7
>> whos
Name Size Bytes Class Attributes
x 1x1 8 uint64
y 1x1 8 double
As you can see, x and y both use 8 bytes, so how could it be more efficient? This example shows that at least we can throw away "uint64".
2) "because many operations can be applied directly to the bits themselves their operations can be faster." I may understand what you said. Did you mean, for example, adding two double has a more complicated mechanism happening inside computer than adding two pure integers?
James Tursa
James Tursa on 27 Jul 2015
Edited: James Tursa on 27 Jul 2015
"... adding two double has a more complicated mechanism happening inside computer than adding two pure integers?"
Yes. When adding two doubles, the code has to check for special bit patterns first (NaN, inf, denormalized). If they are present, then special code to determine the result must be used. If normal bit patterns are present, then you need to handle the difference in exponents for the two numbers to get the mantissas to effectively "line-up" for the addition. Then you need to account for a possible difference in signs (one positive and the other negative). And the result might overflow into an inf pattern, or underflow into a denormalized pattern.
For adding two integers bit patterns, the bits are already "lined-up" since there are no exponents to worry about. So a simple algorithm to add the bits works. And if 2's complement bit format is used (which is typical in modern computers), the exact same algorithm works for positive and negative operands. Overflow/underflow can be detected by examining the register overflow bit and also depends on the signed/unsigned status of the operands. But overall this can be much less work than adding doubles (although micro-code for adding doubles is still pretty fast).

Sign in to comment.


Walter Roberson
Walter Roberson on 27 Jul 2015
What you are missing is that a 64 bit double cannot represent every number in the range up to 10^308. 64 bit doubles can only precisely represent some values in that range.
The smallest positive integer that a 64 bit double in IEEE 754 format cannot represent properly is 2^53 + 1.
Numbers represented in double are restricted to about 16 digits in accuracy. Once the values get above about 10^16 then the distance between adjacent representable numbers becomes larger than 1.

Categories

Find more on Performance and Memory in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!