Update README.md

Signed-off-by: David Rotermund <54365609+davrot@users.noreply.github.com>
This commit is contained in:
David Rotermund 2023-12-01 15:04:45 +01:00 committed by GitHub
parent 3cd45a767d
commit 05373d2adb
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -45,3 +45,74 @@ Two further examples for 8-bit numbers:
10101010 $=0 \star 2^0+1 \star 2^1+0 \star 2^2+1 \star 2^3 +0 \star 2^4+1 \star 2^5+0 \star 2^6+1 \star 2^7 = 170$
To represent negative numbers, a small trick is necessary: one specific bit codes for the sign of the number. In an $n$-bit number, one would for example reserve the $n$-th it for the sign, and the remaining $n-1$ bits code, as usual, a binary number $z$: If the $n$-th bit is not set, then the result is by default $+z$. If the $n$-th bit is set, the result is $-(2^{n-1})+z$. For illustration, again a small table:
|dual system| decimal system|
| ------------- |:-------------:|
|01111111| +127|
|01111110| +126|
|$\vdots$| $\vdots$|
|00000010| +2|
|00000001| +1|
|00000000| +0|
|11111111| -1|
|11111110| -2|
|$\vdots$| $\vdots$|
|10000010| -126|
|10000001| -127|
|10000000| -128|
Certain bit lengths have names:
| | |
| ------------- |:-------------:|
| 1 Byte| 8 Bit|
|1 Word| 16 Bit|
|1 Kilobyte| 1024 Byte|
|1 Megabyte| 1024 Kilobyte|
|1 Gigabyte| 1024 Megabyte|
|1 Terabyte| 1024 Gigabyte|
## Representation of Real Numbers and Numerical Errors
Real numbers are, by their nature, analogue quantities. Hence we would expect the handling of these numbers on digital computers not to be completely problem-free. Present digital computers usually represent real numbers as floating-point numbers.
$\mbox{floating-point number} = \mbox{mantissa} \cdot \mbox{basis}^{\mbox{exponent}} $
Thereby, the precision, with which the real number can be represented, is determined by the number of available bits."Simple precision" requires 4 Bytes, for "double precision" 8~Bytes are needed. The latter is the default configuration in Matlab. The IEEE format of double precision uses 53~Bits for the mantissa, 11~Bits for the exponent and for the basis the remaining ~2. One Bit of the mantissa respectively the exponent are used for the sign of the quantity. Thus, the exponent can vary between$-1024$ and $+1023$. The mantissa always represents a value in the interval $[1, 2[$ in the IEEE notation. Here, the $52$ Bits are utilized to add up fractions of exponents of 2. The value of the mantissa yields mantissa=$1+\sum_{i=1}^{52} b_i 2^{-i}$, with $b_i=1$ , if the $i$-th bit in the mantissa is set.
## Range Error
The maximal range of the floating-point numbers is determined by the number of bits used to code for the exponent. A typical number for single precision is
$2^{\pm 127} \approx 10^{\pm 38}$
and for double precision
$2^{\pm 1023} \approx 10^{\pm 308} $
Via application of arithmetic operations on these numbers, the range can be exceeded. The error occurring in that case is named a range error. As an example we consider the Bohr radius in SI units
$a_0 = \frac{4\pi\varepsilon_0\hbar^2}{m_ee^2}\approx 5.3\times 10^{-11} \mbox{m} $
The quantity $\hbar$ is Planck's quantum of action divided by $2\pi$. Bohr's radius is in the range of single precision floating-point numbers. However, the same does not hold for the numerator $4\pi\varepsilon_0\hbar^2 \approx 1.24\cdot 10^{-78}\mbox{KgC}^2\mbox{m}$ and the denominator $m_ee^2 \approx 2.34\times 10^{-68}\mbox{KgC}^2$. I.e. neither the numerator nor the denominator can be represented as a single precision floating-point number. Hence, the calculation of Bohr's radius by the given formula can be problematic. A simple solution of this problem lies in the use of natural units, such as Bohr's radius, for distances, etc.
An even bigger problem can be illustrated by the calculation of the factorial. The factorial is defined as
$n! = n\cdot(n-1)\cdot(n-2)\cdot\ldots3\cdot 2\cdot 1 $
In Matlab, it can be easily verified by using the function factorial(n), that the factorial for $n>170$ can not be represented, even with double precision numbers. A way out is provided by the use of logarithms, since the logarithm of a bigger number still gives moderately small values, e.g. $\log_{10}(10^{100}) = 100$. It ensues that
$\ln(n!) = \ln(n) + \ln(n-1) + \ldots + \ln(3) + \ln(2) + \ln(1) $
For bigger $n$, the evaluation of this expression is, however, to laborious. If $n>30$, one is advised to use Stirling's formula
$\ln(n!) = n\ln(n)-n+\frac{1}{2}\ln(2\pi n)+\ln\left(1+\frac{1}{12n}+\frac{1}{288n^2}+\ldots\right) $
The factorial $n!$ can than be written as the following
$n! = \mbox{mantissa}\times 10^{\mbox{exponent}} $
To get the mantissa and the exponent, we form the logarithm to the basis 10 (reminder: $\log_{10}(x) = \ln(x)/\ln(10)$)
$\log_{10}(n!) = \log_{10}(\mbox{mantissa})+{\mbox{exponent}} $
We now associate the integer part of $\log_{10}(n!)$ with the exponent. The post-decimal places are associated with the mantissa, i.e. mantissa = $10^a$ with $a = \log_{10}(n!)-\mbox{exponent}$. 
From these examples we learn that range errors can usually be circumvented with a little creativity.