Using Statistics in Delphi - Part I
This Issue we begin
a several issue look at developing Statistical Routines to use in
your Delphi Applications. These will be designed to use Open
Array Parameters where possible so that you can use them for Standard
Arrays or with the new Dynamic Arrays.
Double & Extended
Whilst the Delphi
Math unit supplies some nice Statistical Routines, they tend to
rely on arrays of Double, whereas I tend to keep things in
Extended. I've always liked working to as many decimal places
as possible and then rounding to the required number of decimal
places as the last step. HOWEVER, Extended are normally slower
than Double (though this could change with the advent of 64-bit
Architectures).
So we are going to
develop all our routines using Extended - though they could
be easily adapted to Double or Single routines - or
even nicely overloaded with Delphi 4!
Whilst our routines
will be using Extended, please note that the accuracy of
the results will depend upon the accuracy of the original data.
Measures of Central
Tendency
These are Statistics
that summarise data by giving us a single value that tells us something
about the "middle" of the data. We use this concept
frequently:
- Average Height
- Most Popular Film
- Average Intelligence
There are 3 common
measurements: Mean, Median & Mode.
Calculating the
Mean
Often when we use
the term "average" we are referring to the Mean.
Since "average" is an imprecise term in English
usage it is normally avoided by Statisticians. Since there is also
more than one type of Mean, strictly speaking we are calculating
the Arithmetic Mean. However since Statisticians don't use
the Geometric Mean or the Harmonic Mean very often,
it is normally assumed that by Mean we imply the Arithmetic
Mean.
Simply put, the Mean
is the Sum of the Values divided by the Number of the Values.


The "funny"
big symbol in the above formulae is the Capital Greek Letter Sigma,
which is used as Mathematical Shorthand for "Sum All The
Values".
Notice that though
we use the same formula, we use a Greek Letter to indicate that
the Statistic comes from the whole population rather than
just a sample. In practice, we tend to work with samples
more than populations.
Now to convert this
to Delphi:
function SumEArray (const
B: array of Extended): Extended;
// Returns the Sum of an Array of Extended
var
I: Integer;
begin
Result := B [Low (B)];
for I := Low (B) + 1 to High (B) do
Result := Result + B [I];
end;
function ESBMean (const X: array of Extended):
Extended;
// Returns the Arithmetic Mean of an Array
of Extended
begin
if High (X) < 0 then
raise Exception.Create ('Array is
Empty!')
Result := SumEArray (X) / (High (X) - Low (X) + 1)
end;
Calculating the
Mode
When we use terms
like "Most Popular" then we are in fact referring to the
Mode. It is the most common value.
Unlike Mean,
not all Data has a Mode - since you may encounter 2 values
to be equally popular (i.e. Bimodal) or every value may be
unique. This value is not as useful as the Mean. But when
it does exist, it indicates that there is grouping in the
data - we will also be interested in cases when there is significant
difference between the Mean and the Mode.
To calculate the
Mode, we need our array to be sorted. Rather
than include a sort routine in the Mode calculation, we will
leave that up to the user. Though this is risky, sorting wastes
a lot of time - and many routines depend on a sorted array. There
are many good sort algorithms available and you can use what ever
best suits your needs.
The following uses
SameFloat from the Rounding Article.
function GetMode (const
SortedX: array of Extended;
var Mode: Extended): Boolean;
// Calculates the Mode of a Sorted Array of
Extended and returns
// True if the Mode exists.
var
I, Freq, HiFreq: Integer;
Matched: Boolean;
begin
if High (SortedX) < 0 then
raise Exception.Create ('Array is
Empty!')
else if High (SortedX) = 0 then //
Only a Single Value
begin
Mode := SortedX [0];
Result := True;
end
else
begin
Mode := 0;
Freq := 1; // Frequency
of current Value
HiFreq := 0; // Highest
Frequency so far
Matched := False; //
If False HiFreq is Unique
for I := 1 to High (SortedX)
do
begin
if SameFloat (SortedX
[I - 1], SortedX [I]) then
Inc (Freq) //
count the number of values
else
begin
if Freq <>
1 then // now see if frequency is highest
begin
if
Freq = HiFreq then // not unique
Matched
:= True
else
if Freq > HiFreq then // new
HiFreq
begin
Mode
:= SortedX [I - 1];
HiFreq
:= Freq;
Matched
:= False;
end;
Freq
:= 1;
end;
end;
end;
//Handle special End
cases
if HiFreq > 0 then
// Last value might be HiFreq
begin
if Freq = HiFreq then
Matched := True
else if Freq > HiFreq
then
begin
Mode := SortedX
[High (SortedX)];
Matched := False;
end;
end
else if Freq > 1 then //
All values identical
begin
HiFreq := Freq;
Mode := SortedX [0];
Matched := False;
end;
// Mode exists if
HiFreq is Unique
Result := (HiFreq > 0) and not
Matched;
end;
end;
Median
The Median
is a way of measuring the exact middle when the values are listed
in sorted order.

So if the number
of values is odd we take the middle value (eg 11 values then we
take value 6 as there are 5 lower values and 5 higher values). If
it is even we take the middle two values and find their Mean.
Like Mean,
Median has the advantage of always existing for numerical
data. Like Mode, Median requires the Array to be sorted.
function GetMedian (const
SortedX: array of Extended): Extended;
// Returns the Median for a Sorted Array of
Extended.
var
N: Integer;
begin
N := High (SortedX) + 1;
if N <= 0 then
raise Exception.Create ('Array is
Empty!')
else if N = 1 then //
Only a Single Value
Result := SortedX [0]
else if Odd (N) then //
Handle Odd Number of Values
Result := SortedX [N div 2]
else // Handle Even Number
of Values
Result := (SortedX [N div 2 - 1] + SortedX
[N div 2]) / 2;
end;
Conclusion
Next Issue we will
continue are look at Statistics as we look at measures of Dispersion
such as Standard Deviation and Variance.
|