## Introduction

A \(16-bit\) floating-point number is stored roughly as below```
0 | 100 0000 00|00 0001
sign Mantissa Exponent
```

One bit for sign of the number, \(6\) bits for the exponent and \(9\) bits for significant numbers, mantissa. For more details on how the number is returned to base \(10\) , see here.

This is similar to showing numbers with scientific notation in base \(10\)

$$-123 = (-1) \times 0.123 \times 10^3 $$

$$0.000456 = (+1) \times 0.456 \times 10^{-3}$$

Looking at this style of saving a floating number, we can conclude below points

- The precision is finite and dependent on the size of the mantissa. So, if mantissa can only store \(3\) digits

$$x = 0.12345 \Rightarrow x = 0.123$$

- Multiplication happen by calculation of mantissa’s and exponents separately and then round-off

- Addition happens by bringing the small number to the same exponent as the large number then adding mantissa’s

Note that in the above example, the whole small number disappeared in the result.

## Implication in physics

In physics codes, we have to ensure the precision is enough to capture the small magnitude phenomena. For example in a system that maximum temperature is \(1000\degree C\), with single-precision numbers (8 digits), we cannot have a heat source at \(0.00001 \degree C\).In systems with different unit properties like pressure, density, and velocity, it becomes hard to track the precision of all properties. Therefore, they are usually normalized with reference values to measure their accuracy from 1.

For example, let’s have a look at Bernoulli’s equation in a horizontal pipeline:

$$p + \frac{1}{2} \rho v^2 = c$$

Where \(p\) is pressure, \(\rho\) is the density of the fluid, \(v\) is the velocity of the fluid and \(c\) a constant. Before coding, we can normalize the equation. Let's assume the pressure is in the order of atmospheric pressure, \(p_0\). A reference velocity can be found as \[ p_0 = \frac{1}{2} \rho v_{0}^2 \\ \text{ }\\ v_0 = \sqrt{ 2 p_0 / \rho} \] Dividing both side by \(p_0\) \[ \frac{p}{p_0} + \frac{1}{2} \frac{ \rho v^2}{p_0} = \frac{c}{p_0} \\ \text{ }\\ \frac{p}{p_0} + (\frac{ v}{v_0})^2 = c_2 \\ \text{ }\\ P + V^2 = c_2 \] So we code the equation with \(P\) and \(V\) which are close to \(1\) rather than absolute pressure and velocity which can be of different orders, for example, \(5 \times 10^5\) pa and \(5\) m/s respectively. Note that by normalization, we do not increase the accuracy but facilitate analyzing accuracy.## A trick to improve the precision

There are situations that during the simulation runtime the correction to a property is much smaller than its initial value. For example, see the equation below \[ p_{new} = p_{old} + \Delta p \] \(p\) is the pressure of a system, and \(\Delta p\) is the correction to pressure. This line might happen \(1000\) times in a simulation. If the pressure is \(1000,000\) pa but the corrections are in the order of \(1\) pa, the numbers which are stored in a single-precision variable look like below \[ 1000,001.2 \\ 1000,123.4 \\ \] and we cannot store \[ 1000,123.456 \] We are losing ~ 4 digits precision. Because during the \(1000\) steps of the simulation the first 4 digits are never changed. In this case, we can work with an alternative variable in the code \[ \bar{p} = p - 1000,000 \] So we code the program based on \(\bar{p}\) which stores below numbers \[ 1.2345678 \\ 123.45678 \] And when presenting the results, we convert them back to \(p\).## Round-off error accumulation

In algorithms where round-off error is accumulated the outcome deviates from what expected. In the below example, \(1/3\) is constantly added to a number then it is subtracted the same amount from the outcome. The result is different from the initial value.```
#include <iostream>
#include <iomanip>
int main()
{
float initial = 1.0;
float result = initial;
float one_third = 1.0/3.0;
int iterations = 100000;
for (auto i = 1 ;i< iterations;i++)
result = result + one_third;
for (auto i = 1 ;i< iterations;i++)
result = result - one_third;
std::cout << std::fixed << std::setprecision(8)
<< "initial= " << initial // 1.00000000
<< "\nresult = "<<result; // 1.00017393
}
```

It can be concluded that the error of each term added to the next term and we get an accumulation of the round-off error.

Note that if the sequence was in a way that we had subtraction of terms, the errors could cancel out each other.