Chapter 12 Frequently Asked Questions

Top  Previous  Next

Q. I got better results with the Evolution method than the Swarm method. Why then do you even have both methods?

A. You should always evaluate your formulas on out-of-sample (evaluation) data, not in-sample (training) data.  Nevertheless, we don't think either method is always better than the other, which is why we have two. Sometimes one is better; sometimes the other is better.
 

Q. Do I have to normalize all of my variables before I feed them to ChaosHunter?

A. If by "normalize" you mean scaling all the inputs into the same range, as is often done in statistics (e.g. Z-score), the answer is no, because there is an option in ChaosHunter that can do that for you automatically.
 
However, there may be other normalization preprocessing you may want to do. For example, if the values of a particular input are constantly growing over time, you may want to represent that input as a change or percent change from the previous value.  See Data Preprocessing for more techniques.

Q. Can you give an example of not coding inputs with monotonic values unless they represent monotonic concepts?

A. Suppose you want an input value that is the code for which of your company’s manufacturing plants made a particular part. The plants are East Coast, Southern, Northern, and West Coast. You might be tempted to make this your input:
 
1=East Coast
 
2=Southern
 
3=Northern
 
4=West Coast
 
This would be OK if there are only two plants, but with four the increasing values don't mean an increasing input concept to the net.
 
If, however, you were trying to represent the fact that some plants are more capable than others, and the capabilities were listed most to least or least to most, then your coding is OK.
 
Otherwise you might want to use four variables, one for each plant, which are either one or zero. Of course, if you have too many plants, then you run the risk of including too many inputs.

Q. In view of the question above about monotonic inputs, can I use a person's zip code as an input? Certainly I can't use one input that is 0 or 1 for each zip area.

A. Of course you can't use thousands of inputs. But you can't use the zip code number either. Think about what you are trying to represent with the zip code. Suppose it is probable economic status, since you know some zip codes cover wealthy areas and others represent poverty areas. Translate your zip codes into a variable such as this:
 
1=poor
 
2=middle class
 
3=upper middle class
 
4=wealthy

Q. I get great training results but when I feed in some out-of-sample patterns the results are pretty poor. Have any ideas why? I have 40 inputs and 15 training patterns.

A. Yes, we do. When you use too many inputs and too few training patterns, you are really asking for over-fitting trouble! Mathematically, if you had as few as 15 inputs you could exactly fit the data even with a linear model. The old statistical rule of thumb (use AT LEAST 10 times as many patterns as you have inputs" works as a minimum for our non-liner models. More is better to a point.

Q. What is the point where more is not better?

A. If you use so many patterns that you have clusters of the same or close patterns, then you aren't adding anything, and large clusters may be biasing the formula.
 
If you are doing financial predictions, more data means going back further in time when the markets were different. If you do that, then you are creating a formula with patterns which may be irrelevant or even erroneous in today’s markets.

Q. Is there a way to know if over-training has occurred?

A. The only way to know this for sure is test on out-of-sample data.

Q. What do I do if some of my data fields are missing?

A. On the Select Inputs screen, you may also click on the appropriate radio button to have ChaosHunter skip rows in your data file that contain blank or missing values or to include those rows by having ChaosHunter replace missing values with average values from each column.