Data Preprocessing

Top  Previous  Next

Normalize Inputs
ChaosHunter gives you the option to "normalize"or scale all of the inputs and/or the output into the same range, as is often done in statistics. If you elect to scale either inputs or the output, the ChaosHunter applies a statistical Z-score to the data prior to calculating the formula. The Z-score adjusts each data value by subtracting the mean of that data series (the mean of that variable) from the data and then dividing by the standard deviation. This “scales” the data into a more narrow range, which often results in better formulas.

 

On the other hand, scaled data is harder to understand because the numbers will not appear natural to you. Unscaling will be necessary if you scaled the output as well as the inputs. Inputs will have to be scaled each time they are entered into the formula. The ChaosHunter Runtime module takes care of scaling and unscaling.
 
However, there may be other normalization preprocessing you may want to do. For example, if the values of a particular input are constantly growing over time, you may want to represent that input as a change or percent change from the previous value.

Coding Monotonic Inputs
Suppose you want an input value that is the code for which of your company’s manufacturing plants made a particular part. The plants are East Coast, Southern, Northern, and West Coast. You might be tempted to make this your input:
 
1=East Coast
 
2=Southern
 
3=Northern
 
4=West Coast
 
This would be OK if there are only two plants, but with four the increasing values don't mean an increasing input concept to the net.
 
If, however, you were trying to represent the fact that some plants are more capable than others, and the capabilities were listed most to least or least to most, then your coding is OK.
 
Otherwise you might want to use four variables, one for each plant, which are either one or zero. Of course, if you have too many plants, then you run the risk of using too many inputs.

Using a Zip Code as an Input
For the same reason, you can't use thousands of zip codes as inputs. But you can't use the zip code number either. Think about what you are trying to represent with the zip code. Suppose it is probable economic status, since you know some zip codes cover wealthy areas and others represent poverty areas. Translate your zip codes into a variable such as this:
 
1=poor
 
2=middle class
 
3=upper middle class
 
4=wealthy

 

Missing Data

You may also click on the appropriate radio button in the Select Inputs screen to have ChaosHunter skip rows in your data file that contain blank or missing values or to include those rows by having ChaosHunter replace missing values with average values from each column.