Adjusting General Models |
Top Previous Next |
The discussion below describes steps you can take to improve model results. Click on the links for detailed instructions for specific ChaosHunter Screens.
General In the ChaosHunter, the larger you make any of the parameters, or the more of anything that you select, the more difficult your problem becomes and the harder it may be to solve. On the other hand, selecting too little of anything may leave the tool too restricted to get good answers. Let experience be your guide, because building good models is more of an art than a science. Below we give you our ideas based on our own experiments to get you started on learning the art of model building with ChaosHunter.
Selecting Data It is best if the data you use is highly representative of the problem itself. You want to include all variables that may have an effect on the output for which you desire a formula, and you want to include many different combinations of those variables in the data.
If you want to select one set of data for building the formula and a subsequent set for testing the formula, make sure that both sets are representative. If you do not, your formula may not work well on the test data.
Selecting Inputs The ChaosHunter will itself choose which of the inputs you have selected will participate in the formula. However, you can assist the process by making a judicious selection of the inputs to begin with. If you choose a huge set of variable columns as inputs, the ChaosHunter may have to work very hard and very long to find the best ones. If you already have an idea which variables will be most effective in a formula, then select only those as inputs. The fewer you select, the easier the job will become. However, the ChaosHunter may still reject some of your inputs for the formula, and you cannot control that.
On the other hand, if you only have a few inputs for your model, you might select some technical indicators to apply to your existing inputs in order to improve your results. See Example 8 for details on how to let ChaosHunter discover input relationships.
Selecting the Output You probably already know what variable you want to be the output. This is the variable that you want the formula to compute as closely as possible. However, if your goal is to produce trading signals, then the output selected should be a price time series (usually the open or the close of the time bar). You may only select one output, but of course you can make multiple models with the data, each with a different output.
To Scale or Not to Scale If you elect to scale either inputs or the output, the ChaosHunter applies a statistical Z-score to the data prior to calculating the formula. The Z-score adjusts each data value by subtracting the mean of that data series (the mean of that variable) from the data and then dividing by the standard deviation. This “scales” the data into a more narrow range, which often results in better formulas.
On the other hand, scaled data is harder to understand because the numbers will not appear natural to you. Unscaling will be necessary if you scaled the output as well as the inputs. Inputs will have to be scaled each time they are entered into the formula. The ChaosHunter Runtime module takes care of scaling and unscaling.
Note: Scaling of inputs does not apply to technical indicators and technical indicator time series. They are never scaled.
Selecting Optimization Strategy Evolution Strategy is based on genetic methods and Darwinian theory. Essentially the computer is applying time-proven techniques of selective breeding to find the optimal combination of variables, functions, and constants to obtain more and more optimal models to fit you data.
Swarm optimization is based on observations of how birds and insects swarm to optimize the model. The idea is that the swarm moves itself to more optimal solutions.
Usually, Evolution Strategy works the best, especially if there are a large number of variables, or you have chosen a large number of functions. Swarm optimization tends to work better under the opposite conditions, or when you have run a long period of time with Evolution Strategies and not much progress is being made.
Selecting Population The larger the population size you choose, the greater the chance that effective models are found. Larger populations, however, require much more time to process, so that new populations are created much more slowly. Finding a good balance is important, and the most effective population sizes we have found are from 100 to 500. If you have a large data set (say over 20,000 rows) then smaller populations may be in order. If you have a lot of time to run your models, you might try populations of 1000 or 2000.
Selecting Random Number The random number choice allows you to select different random seeds. No random number seed has any more probability than any other of producing the best solution. Often if optimization doesn't produce good results, then stopping and restarting from the beginning with a new seed can be helpful.
Selecting Stopping Criteria It will usually be better to continue optimizing until at least 200 generations have past without improvement. If you are very familiar with your problem, and you are running with slightly newer data, you might have observed that after some number of generations (N) seems to work well, in which case you might choose to stop after that number of generations. In our modeling, we prefer to select neither, and simply watch the progress, stopping the optimization manually when we feel that progress is no longer being made.
You should understand that finding an analytic model with traditional functions can take a great deal of time. In our work, optimizing for 2 to 6 hours is not uncommon, and even longer for difficult problems.
At the end of the day the best advice we can give you is stop optimization when you feel the models is good enough as measured by the various criteria, or when you feel reasonable progress is no longer possible. Experience is the best teacher here.
Selecting Optimization Goal Functions There are five goal functions at the top for curve fitting, and two at the bottom for financial trading models. Click here for a technical description of these goals.
R-Squared is a good general purpose goal for all types of data where you are trying to fit curves. That is because in all cases your goal is to get close to 1.
The Mean Squared Error goal is useful when you want the closest fit you can get between actual and predicted values. It tends to work on shrinking the largest errors first. However, it is difficult to judge by looking at the mean squared errors whether the fit is good enough.
The Correlation goal is useful when you are not as much concerned about actual vs predicted as you are with whether the predictions generally move in the same direction as the actuals move.
Maximize % same sign is really a classification goal, useful for making models to classify data in one of two categories instead of curve fitting. It assumes that one category is characterized by zero or positive actual answers, and the other is characterized by negative ones.
Maximize number within tolerance is best used when your goal is to get all predictions somewhat close to the correct answer, rather getting some very close while others aren’t close enough. With this goal the optimizer stops working on results that are already within a tolerance, so that it can concentrate on those that are not within tolerance.
Buy/sell cutoff is for building trading models, where you are interested in making the most profitable trade timing models. The idea is to create a formula such that whenever the formula is greater than some number x, a buy trade will take place on the next bar (row of data in the time series). Whenever the formula is less than or equal to some number y, a sell trade will take place on the next bar. The numbers x and y are found by the optimizer in the Threshold range you set. For example, if the threshold range is 5, x and y will both be greater than or equal to -5 and less than or equal to 5.
Use buy/sell true/false instead if your formula is supposed to produce true/false results (see the description of the XOR example for more details on true/false results).
Shares, commissions, and Smooth Equity Curve are discussed in the topic Building Trading Models.
Selecting Max Equation Size The equation size is the max number of operations and operands, including constants, than can be used in the final formula (equation). Setting this is where you may have to use some judgment. If you are convinced the problem can be solved by a short formula, then an equation size of 20 to 25 may be sufficient. Otherwise, a number from 40 to 60 will work better, because the Chaos Hunter will actually be working on several formulas at once if it has enough space here. However, larger max equation size can slow down progress, so we rarely recommend a size greater than 120.
Selecting Max Same Symbols Here again some judgment may be necessary. If this parameter is set to 3, then so input can appear in the formula more than three times. Setting the number too large can slow down evolution. In our experiments we have found from 3 to 6 is usually sufficient.
Selecting Max Constants If you feel constants may be important in your model, you should probably use about 10 max constants. The actual number of max constants will be Max Same Symbols times Max Constants. So if Max Same Symbols is 3, and Max Constants is 10, you could see up to 30 constants that take on no more than 10 distinct values. In other words, there could be up to 3 clones of each constant in the formula.
Selecting Constant Range If you use scaling of your inputs and output, constants between plus and minus 5 should work well. If you do not scale at all, select some number in the range of what the output variable might be. Different constant ranges may make solve the problem differently.
Selecting Max Lookback Period This parameter is only used if you have opted to use technical indicators, and you have selected the inputs to be used as parameters to those technical indicators. The lookback will be the most likely max distance back in the time series that data can be found to affect the output. For price time series used in trading models, we like 20 to 30. If you are looking for more frequent trades, a smaller number such as 5 may work better.
Note that if the optimizer selects a lookback of x in any technical indicator, the formula cannot calculate until at least row x.
Selecting Operations This is perhaps the toughest group of settings to decide upon.
We feel you can almost always get good results using operations in the Neural Category if your inputs are either scaled or naturally occur in the range plus or minus 3.
No matter what else you use, the operations in the Arithmetic and Algebra categories can sometimes be helpful.
The Boolean and Relational categories should only be used when your variables are binary (true/false), as in the XOR example.
Statistical and Polynomial categories may be added if your problem defies solution with the other categories.
We sometimes use the Trigonometry operations of Sine and Cosine what we want to make more chaotic functions. The Ln and Log functions are helpful sometimes when you are not scaling.
Using Chaos Input Most definitions of chaotic functions provide for an input which is the last value the formula produced. Turn this option on if you have a time series where you believe that new outputs are affected by previous one, or that you are indeed dealing with chaos. We feel that price time series are almost always so affected. We have no guidance for the initial value except to say that it might be of experimental value.
For the current row of data, the program uses a value computed by the formula on a previous row of data. On the very first row (row #1) the chaos variable is set to an initial value (0 by default). The formula outcome on row #1 is calculated using input data from row #1 and the initial value (0) for the chaos variable. On the second row (row #2) the formula uses inputs from row #2 and the calculated formula value from row #1 for the chaos variable, and so on. Note that it can take quite a few rows of data before the ChaosVar takes on meaningful values (it is not possible to determine how many rows are needed for any particular case or even in general, but we guess at least 10 to 50). If you look at an out-of-sample test, the ChaosVar starts with its initial value. That means it can take quite a few rows of data until the formula is really "up to speed" with respect to the Chaosvar. Out-of sample testing is really most accurate if the formula has had previous rows to include in the calculation of ChaosVar. So when used for trading, as an example, trading signals will be more accurate if the out-of-sample set is not evaluated by itself. See Building Trading Models for more details on using the Chaos Input in trading models. Using Basic Technical Indicators These should only be used in a time series, especially price time series. Experienced traders will be able to choose their own favorites.
Selecting Optimization and Out-of-sample Sets If you have plenty of data rows, you may want to build your model on one set of rows (the optimization set), and test the model on another set of rows (the out-of-sample set). In the ChaosHunter, the optimization set will be some number the first physical set of rows, and the out-of-sample set will be the remainder of the rows. This arrangement allows for time series, wherein the latter rows are the newer ones in the time series.
After the models is built on the optimization set, you can execute (fire) the model on the optimization set, the remaining out-of-sample rows, or both. The results on the out-of-sample set may be different depending on whether it is executed alone or with the optimization set. These differences will occur if the formula contains either the chaos variable (chaosvar) or technical indicators, which both look back in the time series.
If both sets are executed together, we assume it is one continuous time series so that there is no delay in calculating the formula when the out-of-sample set begins, as the lookback can occur in the optimization set. The value of the chaosvar is obtained from the optimization set. If you are plotting an equity curve, it grows continuously from the optimization set.
If the out-of-sample set is executed by itself, the formula cannot be calculated until there are enough previous rows to accommodate the lookback of the technical indicators. The chaosvar starts with its initial value, not a value obtained from the optimization set. Equity curves start at zero.
Other topics of interest: |