Each row represents a single "data point." Each data
point is a single measurement of the dependent variable and its
associated independent variables. Each column represents a different
variable. To take the retail store example,
each store is represented by a row in the dataset. Each variable
(gross margin, supplies-to-equipment ratio, market share, etc.)
occupies a different column.
There should be more data points than variables (more rows than
columns), although there can be exceptions to this. The final model
can have no more terms or coefficients than there are data points
in the dataset used for fitting.
If some data points have missing variables, the data point must
either be removed, or the missing data must be filled in. A simple
way to do this is to fill it in with the mean of the rest of the
data points for that variable.
|
If the dataset is a time-series, then there can be no missing data
points. That is, no measurements can have been skipped. If there
are missing data, then they must be filled in as described above,
or by interpolating between the data before and after.
The data must be numerical. However, some qualitative variables
can be transformed into numbers. In the retail stores example, some
variables had a yes/no quality. These could be represented numerically
as a 1 or a 0. Similarly, male/female, treated/untreated, or any
other two-way distinction could be represented this way. This is
called coded variables or dummy variables.
Sometimes the dependent variable can be a dummy variable. An example
is the wavemaker machine example, where
failure of the machine was coded as a 1, and nonfailure was coded
as 0. Then, when a prediction is made, the output should be rounded
off to 0 or 1. There is a different type of regression designed
for this situation called logistic regression. MPR can be used in
place of logistic regression. It then brings along its advantage
of being capable of describing nonlinearities including interactions.
|