Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Reducing precision to Float16

Subject: Reducing precision to Float16

From: Jenne Stamplecoskie

Date: 21 Feb, 2009 19:08:02

Message: 1 of 7

Hi,

I would like to be able to convert a number from single or double precision to float16. I know how to create a float structure, but not how to use it to convert my vector from double to float16. I do not need to keep it in that format for further operations, I only need to be able to reduce the precision of my vector. Any help would be appreciated.

Thank you,

Jenne

Subject: Reducing precision to Float16

From: Roger Stafford

Date: 21 Feb, 2009 21:24:01

Message: 2 of 7

"Jenne Stamplecoskie" <jkmiklos@hotmail.com> wrote in message <gnpjei$j5t$1@fred.mathworks.com>...
> Hi,
>
> I would like to be able to convert a number from single or double precision to float16. I know how to create a float structure, but not how to use it to convert my vector from double to float16. I do not need to keep it in that format for further operations, I only need to be able to reduce the precision of my vector. Any help would be appreciated.
>
> Thank you,
>
> Jenne

  The following are links to a definition of the "half precision" floating point format, in case you haven't seen them already. This format is said to be included in the proposed IEEE754r standard, though it is not part of the established IEEE754 standard. I would assume that this is the same as the Float16 format.

 http://en.wikipedia.org/wiki/Half_precision
 http://www.answers.com/topic/half-precision

  I am not quite sure what you mean by "reduce the precision of my vector". Do you wish to actually convert a matlab double to some 16-bit quantity whose format is that of half precision floating, or do you merely want to do the appropriate rounding to a double which would correspond to its having been converted in this manner though still represented in double? Presumably doubles that were too large or too small would be converted to inf's and zeros, respectively. Hopefully you are aware of the limited range of half precision floating point numbers, with a maximum of 65504 and a minimum normalized of 2^(-14).

  With some considerable labor, the latter operation should be quite possible to do, though I don't see any evidence of it having been done in the matlab file exchange. If you want the actual 16-bit quantities, I am not sure what you wish to do with them. They could be stored as, say, uint16 integers, but what would you do with them afterwards?
 
Roger Stafford

Subject: Reducing precision to Float16

From: Jenne Stamplecoskie

Date: 22 Feb, 2009 21:02:01

Message: 3 of 7

"Roger Stafford" <ellieandrogerxyzzy@mindspring.com.invalid> wrote in message <gnprdh$a66$1@fred.mathworks.com>...
>
> I am not quite sure what you mean by "reduce the precision of my vector". Do you wish to actually convert a matlab double to some 16-bit quantity whose format is that of half precision floating, or do you merely want to do the appropriate rounding to a double which would correspond to its having been converted in this manner though still represented in double? Presumably doubles that were too large or too small would be converted to inf's and zeros, respectively. Hopefully you are aware of the limited range of half precision floating point numbers, with a maximum of 65504 and a minimum normalized of 2^(-14).
>
> With some considerable labor, the latter operation should be quite possible to do, though I don't see any evidence of it having been done in the matlab file exchange. If you want the actual 16-bit quantities, I am not sure what you wish to do with them. They could be stored as, say, uint16 integers, but what would you do with them afterwards?
>
> Roger Stafford

Thank you for your reply,

The appropriate rounding to half-precision is sufficient. I would like to test the feasibility of using 16-bit numbers for transmission on a hardware platform. I would have the ability to perform calculations in single or double precision, but then would be transmitting across a low-BW satellite link, and would like to see if Float16 is sufficient to transmit the data and have valid results when I perform the remaining simulations at the other end. Using single or double precision data format is no problem, as I would always have that available for calculations, but at a certain stage I would like to be able to round to the appropriate level.

Jenne

Subject: Reducing precision to Float16

From: Roger Stafford

Date: 23 Feb, 2009 01:27:01

Message: 4 of 7

"Jenne Stamplecoskie" <jkmiklos@hotmail.com> wrote in message <gnseg9$lml$1@fred.mathworks.com>...
> Thank you for your reply,
>
> The appropriate rounding to half-precision is sufficient. I would like to test the feasibility of using 16-bit numbers for transmission on a hardware platform. I would have the ability to perform calculations in single or double precision, but then would be transmitting across a low-BW satellite link, and would like to see if Float16 is sufficient to transmit the data and have valid results when I perform the remaining simulations at the other end. Using single or double precision data format is no problem, as I would always have that available for calculations, but at a certain stage I would like to be able to round to the appropriate level.
>
> Jenne

  For now, I will give a somewhat imperfect method of converting from a double to the rounded form that would be appropriate for a half-precision floating point number (which I presume is the same as a Float16 format.)

  Let x be a double in matlab. Then do this:

 [f,e] = log2(abs(x));
 if x~=0, while f < 1, f = 2*f; e = e-1; end, end
 y = sign(x)*round(f*2^10)*2^(e-10);

Then y will be x but rounded as though it were in half-precision (with the exceptions noted below.)

  Note: The while-loop action here is necessitated by the fact that in the documentation for 'log2' it says "Argument F is an array of real values, usually in the range 0.5 <= abs(F) < 1." The word 'usually' is the fly in the ointment here and hence the while-loop. Actually f would rarely be outside this range but apparently Mathworks cannot guarantee that it wouldn't be in some unusual situation.

  There are three shortcomings with this code which is why I said "imperfect". The first is that it doesn't pay any attention to cases where x would be out of bounds, that is, past 65504.5 in magnitude. In these cases it should give an inf or -inf. That would be easy to fix.

  The second lies in the use of matlab's 'round' function above. The IEEE standard 754 (and presumably 754r) have rules for unbiased rounding as applied to cases where the number to be rounded lies exactly half way between either of the values it could be rounded to. As applied to matlab's 'round' that would require it to round something like 7.5 to 8 and 8.5 to 8 also, namely always choosing the integer that is even in such a half-way case, but matlab's 'round' doesn't do that. This could also be fixed up in the above code with a little effort.

  The third problem is the case when x is so small in magnitude that in half-precision it would be represented in the denormalized format. I see no way you could accomplish this with a number which actually remains in double format. That would have to be a part of whatever method is used to convert these properly rounded double numbers over to actual half-precision 16-bit numbers. The only thing the above code could do if improved would be to give zeros in all cases that should round to zero in half-precision, and that I haven't done here, though it is not hard to do.

  I am not sure what standards you wish to adhere to in the conversion you are seeking, so I haven't bothered with these three difficulties. And as I say, the denormalization problem cannot be solved entirely simply by rounding properly.

Roger Stafford

Subject: Reducing precision to Float16

From: Jenne Stamplecoskie

Date: 23 Feb, 2009 02:16:01

Message: 5 of 7

Roger,

Thank you. Your solution worked very well. I will try implementing the three suggestions you made to improve it.

Jenne Stamplecoskie

Subject: Reducing precision to Float16

From: Roger Stafford

Date: 24 Feb, 2009 06:21:01

Message: 6 of 7

"Jenne Stamplecoskie" <jkmiklos@hotmail.com> wrote in message <gnt0t1$pjq$1@fred.mathworks.com>...
> Roger,
> Thank you. Your solution worked very well. I will try implementing the three suggestions you made to improve it.
> Jenne Stamplecoskie

  Jenne, that code I sent you had too many deficiencies in it for me to be comfortable with it as a final product, so I decided to implement all the improvements I mentioned (and a few more.)

  Assume x is a matlab scalar double of any kind: zero, denormalized, normalized, infinity, or NaN. The following will round it, producing the scalar y, in a way that is appropriate for its use as a half-precision format even though it remains in 64-bit double format. The resulting exponent and significand (mantissa) are all in the proper ranges of values. The rounding of "mid-point" values has been made unbiased in conformity with IEEE 754 specs. Values that are too large for the half-precision format are overflowed to infinities and those that are too small become zeros. The rounding of values that will become denormalized in the half-precision format are correctly rounded, which wasn't true of the previous code.

 % Enter with scalar x
 y = x;
 if finite(y) & y~=0
  s = sign(y); y = abs(y);
  if y >= 65520;
   y = inf;
  else
   [t,e] = log2(y); y = y/2^e;
   while y < 1, y = 2*y; e = e-1; end
   e2 = max(e,-14);
   t = y*2^(10-e2+e);
   y = round(t);
   if y~=t & round(2*t)==2*t, y = 2*round(t/2); end
   y = y*2^(e2-10);
  end
  y = s*y;
 end
 % Depart with properly rounded value in y

  Comments:
1. Infinities, NaNs and zeros are passed through unchanged. In particular, signed zeros are retained.
2. Any absolute value at or beyond 65520 is converted to an infinity. I was mistaken earlier about 65504.5 as a limit. It's too small.
3. As mentioned earlier, the 'while' part compensates for matlab's "usually" documentation uncertainty for 'log2'.
4. e2 provides for quantities to become half-precision denormalized and correctly rounded.
5. The line starting "if y~=t..." adjusts the 'round' function to produce unbiased rounding.

  It should be emphasized that in order to convert to actual 16-bit half-precision numbers, there is much more processing to be done. The above only furnishes the appropriate sign, exponent, and significand for such a conversion. Putting them in place within 16 bits is quite another matter. In particular it requires a detailed understanding of the nature of the denormalized format, the normalized number hidden bit, and a few other oddities of the IEEE 754 (or 754r) format.

Roger Stafford

Subject: Reducing precision to Float16

From: James Tursa

Date: 5 Mar, 2009 04:35:02

Message: 7 of 7

"Jenne Stamplecoskie" <jkmiklos@hotmail.com> wrote in message <gnpjei$j5t$1@fred.mathworks.com>...
> Hi,
>
> I would like to be able to convert a number from single or double precision to float16. I know how to create a float structure, but not how to use it to convert my vector from double to float16. I do not need to keep it in that format for further operations, I only need to be able to reduce the precision of my vector. Any help would be appreciated.
>
> Thank you,
>
> Jenne

FYI, I was inspired by this thread, so I wrote a function to do the IEEE 754r Half Precision conversion. You can find it here:

http://www.mathworks.com/matlabcentral/fileexchange/23173

This only converts bit patterns, it doesn't allow any arithmetic on the half precision bit patterns, but it sounds like that is all you need. For your specific problem, of course, it sounds like you might be able to invent your own format and not restrict yourself to IEEE 754r Half Precision. For example, if you know your data is restricted to a certain range, you could maybe use 4 bits for the exponent instead of 5 and use 11 bits for the mantissa instead of 10. Or if you don't care about very small numbers then you could use an exponent bias of 14 or 13 instead of 15. Or you could just scale your data and round it and send an int across. etc. etc. In any event, HTH.

James Tursa

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us