Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

As well as BigDecimal, decimals can have type Float or Double. Unlike BigDecimal which has no size limit, Float and Double are fixed-size, and thus more efficient in calculations. BigDecimal stores its value as base-10 digits, while Float and Double store their values as binary digits. So although using them is more efficient in calculations, the result of calculations will not be as exact as in base-10, eg, 3.1f + 0.4f computes to 3.499999910593033, instead of 3.5.

We can force a decimal to have a specific type other than BigDecimal by giving a suffix (F for Float, D for Double):


We can enquire the minimum and maximum values for Floats and Doubles:


We can represent infinities by using some predefined constants (prefixed by either Float or Double):


If a nonzero Double literal is too large or too small, it's represented by Double.POSITIVE_INFINITY or Double.NEGATIVE_INFINITY or 0.0:


Classes Float and Double can both be written uncapitalized, ie, float and double.


There's a special variable called Double.NaN (and Float.NaN), meaning "Not a Number", which is sometimes returned from math calculations. Once introduced into a math calculation, the result will (usually) be NaN.

Conversions

The Float and Double classes, along with BigDecimal, BigInteger, Integer, Long, Short, and Byte, can all be converted to one another.

Converting numbers to integers may involve rounding or truncation:


Converting from integers to float or double (may involve rounding):


Converting from BigDecimal to float or double (may involve rounding):


We can convert a double to a float. but there's no Double() constructor accepting a float as an argument.


We can create a Float or Double from a string representation of the number, either base-10 or hex:


The string is first converted to a double, then if need be converted to a float.

Converting from double to BigDecimal is only exact when the double has an exact binary representation, eg, 0.5, 0.25. If a float is supplied, it's converted to a double first, then given to the BigDecimal constructor. The scale of the returned BigDecimal is the smallest value such that (10**scale * val) is an integer.


A more exact way to convert a double to a BigDecimal:


We can convert a float or double to a unique string representation in base-10. There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type float. (The returned string must be for the float value nearest to the exact mathematical value supplied; if two float representations are equally close to that value, then the string must be for one of them and the least significant bit of the mantissa must be 0.)


We can also convert a float or double to a hexadecimal string representation:


Floating-Point Arithmetic

We can perform the same basic operations that integers and BigDecimal can:


For + - and *, anything with a Double or Float converts both arguments to a Double:


We can divide using floats and doubles:


We can perform mod on floats and doubles:


IEEEremainder resembles mod in some ways:


We can perform other methods:


We can raise a float or double to a power:


We can test whether a float or double is a number and whether it's an infinite number:


We can test whether two floats or doubles have equal values using operators or methods:


We can compare floats and doubles using the <=> operator, the compareTo() method, and the compare() static method:


Auto-incrementing and -decrementing work on floats and doubles:


Non-zero floats and doubles evaluate as true in boolean contexts:


Bitwise Operations

We can convert a float to the equivalent int bits, or a double to equivalent float bits. For a float, bit 31(mask 0x80000000) is the sign, bits 30-23 (mask 0x7f800000) are the exponent, and bits 22-0 (mask 0x007fffff) are the mantissa. For a double, bit 63 is the sign, bits 62-52 are the exponent, and bits 51-0 are the mantissa.


The methods floatToRawIntBits() and doubleToRawLongBits() act similarly, except that they preserve Not-a-Number (NaN) values. So If the argument is NaN, the result is the integer or long representing the actual NaN value produced from the last calculation, not the canonical Float.NaN value to which all the bit patterns encoding a NaN can be collapsed (ie, 0x7f800001 through 0x7fffffff and 0xff800001 through 0xffffffff).

The intBitsToFloat() and longBitsToDouble() methods act oppositely. In all cases, giving the integer resulting from calling Float.floatToIntBits() or Float.floatToRawIntBits() to the intBitsToFloat(int) method will produce the original floating-point value, except for a few NaN values. Similarly with doubles. These methods are the only operations that can distinguish between two NaN values of the same type with different bit patterns.


As well as infinities and NaN, both Float and Double have other constants:


Floating-Point Calculations

There are two constants of type Double, Math.PI and Math.E, that can't be represented exactly, not even as a recurring decimal.

The trigonometric functions behave as expected with the argument in radians, but 0.0 isn't represented exactly. For example, sine:


Other trig functions are:


Some logarithmic functions:


Math.ulp(d) returns the size of the units of the last place for doubles (the difference between the value and the next larger in magnitude).


Accuracy of the Math methods is measured in terms of such ulps for the worst-case scenario.If a method always has an error less than 0.5 ulps, the method always returns the floating-point number nearest the exact result, and so is always correctly rounded. However, doing this and maintaining floating-point calculation speed together is impractical. Instead, for the Math class, a larger error bound of 1 or 2 ulps is allowed for certain methods. But most methods with more than 0.5 ulp errors are still required to be semi-monotonic, ie, whenever the mathematical function is non-decreasing, so is the floating-point approximation, and vice versa. Not all approximations that have 1 ulp accuracy meet the monotonicity requirements. sin, cos, tan, asin, acos, atan, exp, log, and log10 give results within 1 ulp of the exact result that are semi-monotonic.

Further Calculations

We can find the polar coordinate of two (x,y) coordinates. The result is within 2 ulps of the exact result, and is semi-monotonic.


We can perform the hyperbolic trigonometric functions:


We can convert between degrees and radians. The conversion is generally inexact.


We can calculate (E*x)-1 (1 + x) in one call. For values of x near 0, Math.expm1( x ) + 1d is much closer than Math.exp( x ) to the true result of ex. The result will be semi-monotonic, and within 1 ulp of the exact result. Once the exact result of e*x - 1 is within 1/2 ulp of the limit value -1, -1d will be returned.

We can also calculate ln(1 + x) in one call. For small values of x, Math.log1p( x ) is much closer than Math.log(1d + x) to the true result of ln(1 + x). The result will be semi-monotonic, and within 1 ulp of the exact result.


Scale binary scalb(x,y) calculates (x * y**2) using a single operation, giving a more accurate result. If the exponent of the result would be larger than Float/Double.MAX_EXPONENT, an infinity is returned. If the result is subnormal, precision may be lost. When the result is non-NaN, the result has the same sign as x.


We have square root and cube root methods. For cbrt, the computed result must be within 1 ulp of the exact result.


We can find the ceiling and floor of doubles:


We can round doubles to the nearest long (or floats to the nearest integer). The calculation is Math.floor(a + 0.5d) as Long, or Math.floor(a + 0.5f) as Integer


Unlike the numerical comparison operators, max() and min() considers negative zero to be strictly smaller than positive zero. If one argument is positive zero and the other negative zero, the result is positive zero.


Some other methods:


The pow() method returns the value of the first argument raised to the power of the second argument. If both arguments are integers, then the result is exactly equal to the mathematical result of raising the first argument to the power of the second argument if that result can in fact be represented exactly as a double value. Otherwise, special rules exist for processing zeros and infinities:


More methods:


We can use copySign() to return a first argument with the sign of the second argument.


We can compute the hypotenuse with risk of intermediate overflow (or underflow). The computed result is within 1 ulp of the exact result. If one parameter is held constant, the results will be semi-monotonic in the other parameter.

We can get the exponent from the binary representation of a double or float:

We can return the floating point number adjacent to the first arg in the direction of the second arg:

The result is NaN if the argument is NaN for ulp, sin, cos, tan, asin, acos, atan, exp, log, log10, sqrt, cbrt, IEEEremainder, ceil, floor, rint, atan2, abs, max, min, signum, sinh, cosh, tanh, expm1, log1p, nextAfter, and nextUp.
But not so with pow, round, hypot, copySign, getExponent, and scalb.

There's another math library called StrictMath that's a mirror of Math, with exactly the same methods. However, some methods (eg, sin, cos, tan, asin, acos, atan, exp, log, log10, cbrt, atan2, pow, sinh, cosh, tanh, hypot, expm1, and log1p) follow stricter IEEE rules about what values must be returned. For example, whereas the Math.copySign method usually treats some NaN arguments as positive and others as negative to allow greater performance, the StrictMath.copySign method requires all NaN sign arguments to be treated as positive values.

  • No labels