|
LeoStatistic software for data presentation, statistical analysis, marketing and prediction. Free download: |
Statistics. Statistics is the most fundamental scientific discipline. Not philosophy but statistics is the very base of the scientific method. It's not intuitively obvious that there is literally direct connection of statistics and natural selection and developing of instinct responses for any living subject but it is indeed. The everyday life of the average person does not offer too many examples of statistical method, save for gambling. That is the result of a highly organized and multifaceted modern life that we live that contains an infinite amount of factors that all mesh together seamlessly. Initially behind any statement you can find reasoning that is backed up by statistical analysis. Sometimes this statistical analysis is wrong in sense of its interpretation as it was for example for ancient statement that force is the reason for movement with constant rate. Statistically this idea was very broadly supported by undeniable connection between necessity to push cart to make it moving. Millions observations summons into the wrong physical law but right statistical statement. Other observations like continuous speed of the ahead movement of stone thrown by hand are contradicted to the universality of necessity to push any object to keep it continuously moving. But it took a genius of Newton to summarize all known facts into three simple formulated laws. The role of the statistics is to supply researcher with initial compressing of numerous observations into relative short statement. Let's consider how statistics can be used to analyze data of radar measured
velocities of cars. What we at first should to do is to built a histogram of
distribution measured velocities. Then if the shape of histogram will happen to
be a bell like, average value of the rate and its standard deviation can be
calculated:
where n is a number of individual measurements. This two values already are very useful to make educated predictions about with what rate the car could hit innocent or nor too much such pedestrian if he or she will be really stupid to run across the street here. Analyzing data more in more details one can notice that depend on time of day there are some systematic variations of average rate of cars. At the morning at rush hours it could be less then in other periods. One can to try to fit this data with a curve on the chart with coordinates rate - time with the help of some periodic formula. Other and more reasonable approach could be creation additional parameter - frequency of measurements of different cars - value reflected intensity of traffic and try to fit curve between these two parameters that most possible can be done with much simple and not periodic formula. By doing such analysis we can significantly increase predictability rate of car at any given moment of day. By it is not only one our option to improve such analysis. We can have recorded besides rates of cars and time of the measurement also weather conditions, color of the car, their plate number and by this way associated with any given car full information about its technical specifications, we can also from picture of the driver's his or her race, age, gender and education data. So we can for one primary interesting us value - rate of the car collect associated with it data of many, dozens of arguments - other parameters of the event. Quite possible that same of them like color of the car will produce no statistically significant influence on the rate of car or will, who knows, other like intensity of precipitations - will and real strong affect. Statistics will help, with the matrix of correlations for example, to establish mutual influence of all parameters. We can also built some model as in form of single mathematical formula included itself all arguments or with some algorithm create a model that will produce a most possible value depend on numerical values of all measured parameters. Important to note that although there is no universal statistical method magically applicable to any set of data.
LeoStatistic implement most useful statistical methods for data analysis and
modeling. Here these methods will be shortly described in general terms leaving
specific "how to do it" to other pages. Distribution of
one variable.
One can divide domain of the variable on smaller spots and calculate how many
cases will fall into different smaller subintervals, so named bins. Then draw
the rectangle based on width of the bin and with height equal number of the
cases. Such picture named a histogram is represent a probability distribution
for the variable to be found in any given interval. There are numerous
theoretical representations of such distributions including two most popular
T-probability (Student) and Poisson these can be calculated and displayed on
the screen along with its correlation coefficients the histogram. The value of
correlation for the perfect fit is 1.0 and decrease with the mismatch between
the theoretical curve and histogram. The conditional distribution of some
variable can be built too. To do it just built a histogram only for the part of
records these are matched other conditions for other parameters. This
option is specially fruitful to reveal a detailed, even tiny influence of one
variable on the other what is specially applicable for marketing to discover non
functional dependencies.
Approximation. Let presume that experimental results could be described with the formula:
y = f(x1, x2...xn) where y correspond to the value data series and
During experimental research and data modeling quite often one meets the
situation when structure of formula for description of data is known from basic
principles and the task is to find coefficients these are best to fit the data
for the particular experiment. Standard method for calculating of coefficients
is the least squares method. For this method a fitting is based on the criteria
of minimization of a sum of squares of deviations between calculated and
experimental values:
Min(dev(a0,a1,a2,...an))
= S((yt(a0,
a1, a2, , x1, x2
... xn) - ye)2 The task is to find a collection of coefficients a0,a1,a2,...an
when function (4) has minimum value. In the general case for any given form of
the approximated formula there is no analytical solution to find best collection
of fitting coefficients. In the LeoStatistic is one can use one of the numerous
algorithms for the numerical approximation of free format formula.
For special situation when fitting equation has quasipolynomial structure:
F(y) = a0 + a1*f1(x1,x2
...xn) + a1*f2(x1, x2
...xn)+ ... + ... + an*fn(x1, x2
...xn) there is an analytical solution to find coefficients
a0, a1, a2 ... an that
corresponds to the best fitting with experimental data in sense of least squares
deviation.
One can calculate also standard deviations of found coefficients
s Cr.Cf = 1 - (1/n)*S(((F(ye)-F(yt))/
sy)2
The best known style of function (2) is polynomial equation:
Nevertheless the possible forms for them available for the analytical
solution are much more broad and same of them (including simple polynomial) are
implemented in the scope of LeoStatistic. In the spite that the universal method
for the approximation of any imaginable set of data is not exist LeoStatistic
gives a variety of several different schemes with corresponding specific user
interfaces for each of them these could cover most of common situations.
In case when arguments
x1, x2 ... xn are
independent parameters we can talk about multivariate regression and
LeoStatistic implement linear and parabolic presentation of the fitting formula.
Near neighbors method.
This method is based on the presumption that we have no advance knowledge about
mutual dependence between variables. One can assume that for non-sporadic data
the closer a point in multidimensional space is located to other points the more
reasons to suggest that their value will be approximately the same. An other
approach to described this method is to say that estimate value of the point in
n-dimensional space is to say that it is most possible value is an weighted
average of values for most closest points around. The formula for calculation
looks like this:
where - LeoStatistic software application implements following schemes to calculate A distance, dpi, in n-dimensional space between probe and
i-th points is calculated by formula:
where - summing is done by all n arguments.
Scoring. For the task of optimizing a marketing campaign the common problem is to range all potential clients by expected response toward the direct advertisement call. As soon mailing costs money, to get more responses per advertising dollar has all the sense. Common approach for this is the constructing an algorithm that will calculate and assign a score for all clients that could for example be normalized from 0 to 100% representing probability to have positive response. LeoStatistic has tools to solve this problem like building a conditional distributions as well a directly creating a scoring algorithm. |
Screenshots of the LeoStatistic software: click on picture to enlarge
|