*Out-Of-Sample Testing*

When I introduced the momentum system on this site I did not use any form of optimization – my parameters were based on my understanding of what the system was measuring, how the markets moved and changed behavior – including correlations between asset classes, what the strengths and weaknesses of the system might be and hence, how we might accommodate a variety of different assets and different market conditions assuming we might have less reason to worry about lower returns with low risk than look for potentially higher returns with associated higher risk. I claimed (without any proof/backup) that these parameters ** should** be “robust”.

Somehow, (I can’t imagine why 🙂 ) it was difficult for members to accept my unproven claim of robustness and so I got talked into trying to provide some proof of this and we began this series of tests thanks to help from Ernie and Herb who have provided the tools for more rigorous automated back-testing.

As we have stated many times, optimizing parameters used in investment strategies is a dangerous exercise and often leads to systems that disappoint in the longer term going forward. This is why we spend so much time emphasizing the importance of robustness and playing down the need for absolute optimization – we need some room for the inevitable times when our “optimal” parameters are not optimal.

The classical way of testing **any **system is to test the system using Out-of-Sample data i.e. using different assets and different time frames from the initial data set used to test the system. This is particularly important if parameters have been **optimized** using a specific set of historical data.

Although we have tried to be relatively unbiased in that we have consistently used the Rutherford portfolio asset list as our diversified benchmark representation of global market asset classes, we have inevitably migrated towards an optimization of parameters for this specific portfolio over the period 2006-present for which historical data is available. In Part 3 of the study we attempted to simulate Out-of-Sample tests by randomly shuffling sections of the raw data to generate pseudo price profiles for the assets – and this provides a reasonable alternative to using real Out-of-Sample data when available data records are short. However, it is not ideal.

In this part 6 of the study we move to a new portfolio of assets that have records going back ~20 years. Since ETFs were not “invented” 20 years ago we use Mutual Fund data, many of which Funds are now essentially duplicated by “equivalent” ETFs.

The assets chosen for the study were:

VTSMX – Vanguard Total Stock Market Index

VTMGX – Vanguard Developed Markets Index

VEIEX – Vanguard Emerging Markets Index

VBLTX – Vanguard Long-Term Bond Index

VBIIX – Vanguard Intermediate-Term Bond Index

PYGFX – Payden Global Fixed Income

VGSIX – Vanguard REIT Index

FIRAX – Fidelity Advisor International Real Estate

PSPFX – US Global Investors Global Resources (Commodities)

VGPMX – Vanguard Precious Metals and Mining

VFISX – Vanguard Short-Term Treasury

These assets essentially represent all the major asset classes as do the Rutherford assets. The primary difference is the inclusion of an intermediate-term US Bond Fund (VBIIX) rather than the inflation-protection ETF (TIP) included in the Rutherford. Our advantage here is that we have data back to at least 1994 for most (not quite all) assets. We can therefore run a 20-year test that includes both the 2000 bear market following the tech bubble and the 2008 bear market following the financial crisis.

Let’s first split the 20 year time frame into two 10 year periods – January 1995 to December 2004 and January 2005 to Present (August 2015) – and look at the performance curves using the reference “Kipling” model with ROC1 = 60 trading days (87 calendar days), ROC2 = 100 trading days (145 calendar days) and Volatility = 14 day mean variance. Overall ranking is calculated based on a 30%/50%/20% weighting of these parameters. VFSIX is used as the momentum ranking filter.

The first period (Jan 1995 – Dec 2004) is ** totally Out-of-Sample**:

In the above figure, VTSMX is plotted (red line) as a reference (not a suitable benchmark since it does not reflect a globally diversified portfolio). As can be seen from the above figure, the momentum model does not (on average) beat the performance of VTSMX (Total US Equity market) over this period. However, it does avoid the significant 43% drawdown experienced in the US equity markets in the 2 year 2000-2002 period. This is primarily due to the application of the VFSIX ranking filter. Despite the fact that the momentum model does not “beat” the VTSMX in terms of necessarily generating a higher total return, performance, considering volatility (~15%), and drawdown (~21%) is probably far more acceptable to most investors even though there would almost certainly be a level of unhappiness in the 1995-2000 period of the technology bubble. [With the benefit of hindsight, performance could probably have been improved by the inclusion of a technology index (Nasdaq 100) fund in the asset mix – but that’s hindsight that we don’t have]

If we now look at the 2005 – Present period we see the following performance:

Now we see that the “Kipling” model handily beats the return of the VTSMX throughout the period with similar volatility (16.7%) and drawdown (19%) to that observed in the earlier period – i.e. volatility and drawdown are consistent. As a result of the rank filtering we also avoid the 49% drawdown generated in US equities during the 2008 financial crisis.

However, we have to remember that this is the same time period as used to “optimize” the parameters for the Rutherford portfolio – therefore we might expect performance for the Fund portfolio to be more optimized over this time period. This suggests that the “optimized” parameters are not optimal for the 1995-2004 period – this is exactly what we should expect and why we look for robust parameters.

Looking at performance over the entire 20-year period we get the following picture:

The above plot may not look too impressive, particularly in the early years, and probably should be plotted with returns on a logarithmic scale to compare percentage moves rather than emphasizing the exponential returns resulting from the compounding. However, numerically, we should note that the Compound Annual Growth Rate (CAGR) for the first 10 years is still a respectable 9.95% and, for the last 10 years is 13.13% with a 20 year value of 11.47% – so the picture is maybe a little misleading.

Average total portfolio return is 862.6% (11.47% CAGR) but, as in the Rutherford back-tests reported in Part 5 of the study, there is a wide variance of 231.8% resulting in a 95% probability of total returns lying anywhere between ~ 862.6 -/+ 463.6% (2 SD) or ~400% and 1,300% – quite a range due to the effects of compounding over a 20 yr period.

So let’s see what happens if we apply the tranching model to the analysis. The following figure shows average total returns (left hand axis, solid lines) and P(0.9) Returns, the value representing a minimum return with a 90% probability that this value will be exceeded (right hand axis, dashed lines). These returns are plotted as a function of maximum tranche look-back period (as described in Part 5 of the study):

As we saw in Part 5 of the study we see a reduction in average total return as we increase the tranche look-back period. However, we also see a range of look-back periods, between ~5 and 15 days, where P(0.9) Returns are not significantly affected by tranching (although expected average return may be lower as the length of the look-back is increased). Separation periods of 1-3 trading days are to be preferred.

As before, we can compare the variance between standard “Kipling” tests – with no tranching – to a tranching model. Performance of the portfolio using an 11 tranche sequence with 1-day separation is shown below:

Comparing this with the no-tranche figure above we see that the variation in total returns is far “tighter”, but that this comes at the expense of the average total return that is reduced to 773.6% (from 862.6%). We therefore need to ask ourselves *“Are we more comfortable accepting a system that has a 90% probability of generating a minimum 637% return with the possibility of an (average) 774% return or do we prefer to accept a 90% probability that we generate at least 565% with the possibility of an (average) 863% return?”* The answer to this question will obviously be different for each investor depending on their personal situation and tolerance for risk. This will determine whether the tranching model is a viable model for them and what parameters might be used. As Ernie has noted, using a tranching model is likely to increase trade frequency (and hence costs) a little, although tests indicate that this ranges from an average of ~2 trades per period for no tranching to a maximum of ~4 trades per period with many tranches. At the level of tranching we are looking at here (up to ~10 tranches) our studies show that we are probably looking at a maximum of ~3 trades per period.

For the sake of completeness the following show the results of the tranche tests over each 10-year period for comparison with the no-tranche plots.

**Conclusions:**

- The standard “Kipling” momentum ranking model, when applied to Out-of-Sample data, using “optimized” parameters, provides acceptable performance with consistent profitability together with acceptable volatility and draw-downs. However, performance may not be optimal due to the use of non-optimized parameters appropriate to the Out-of-Sample period. This is
**always**to be expected – the question remains*“Do the parameters used represent a robust set of parameters that generate acceptable results with minimum risk?”* - Uncertainty due to the “luck” of review date sequence can be reduced/minimized by using a tranching model in combination with the momentum ranking model. However, while this may reduce the probability of poor minimum returns it may result in a lower maximum and average expected return.

While writing this post I received a message from Lowell asking why the momentum model did not match the performance of VTSMX in the 1995-2000 period – especially since VTSMX is available for inclusion in the portfolio.

I think there are at least 2 reasons for this:

- The model (as tested) requires two assets ranking higher than VFISX to be included in the portfolio – therefore the maximum allocation that can be assigned to VTSMX is 50% and we cannot beat the performance of VTSMX unless another asset is performing better. This is why an Index Fund such as VTSMX is not a good benchmark for a portfolio designed for diversification. It is only appropriate if the objective is to beat/match that index – in which case the entry/exit rules and portfolio makeup would probably be significantly different e.g. the index might simply be broken down into component sectors and bonds and other asset classes might be ignored.
- We know that momentum, by its nature, is sensitive to look-back period in detecting trends – the shorter the look-back period, the faster it will be to pick up a trend and also to exit the trend. However, even in a bullish market, there will be pullbacks and we may encounter whipsaws if we are using an inappropriate look-back period. If we were to use Gary Antonacci’s 12 month look-back period we might see better performance over this 1995-2000 period (I don’t know because I haven’t tried it) and over 40 years this might be a better choice of parameter – but, then again, performance in the 2005-2015 period might be inferior.

Out of interest, after receiving Lowell’s question, I decided to go back to my original **non-optimized** parameter settings of ROC1 = 63 trading days (91 calendar days), ROC2 = 126 trading days (182 calendar days) and Volatility = 10 day mean variance. I also went back to my original weights of 50%/30%/20% respectively since I’d always thought that weighting the short-term momentum period higher would be preferable in terms of the intermediate-term time horizon that I was most concerned about.

The results are shown below:

Average total returns are 1077% compared with the tests using “optimized” parameter values that came in at 863% (12.56% CAGR vs 11.06% CAGR). Volatility and draw-down remains about the same.

If I were to choose to reduce that variance due to check date “luck” I might use a tranche model using a 10 trading day look-back (2 week’s data) with 1-day separation between tranches:

This reduces my average expected return to 880.5% but gives me a 90% probability of making at least 692%. However, this is where we have to be a little careful since this is exactly the same P(0.9) return we might expect from using the basic no-tranche Kipling model that has a higher average expected return.

Finally, one additional observation. Some members may ask if there is an advantage in only selecting the top ranked asset rather than top 2. I ran a quick test using 2-day separation tranches and compared the average and P(0.9) returns for different look-back periods:

As we might expect, the average total returns are reduced by increasing diversity (2 assets vs 1), however, note that variance is reduced since the P(0.9) returns are higher for the 2 asset scenario than for the 1-asset case.

**Summary:**

Optimization is fine and can be a great learning experience but don’t get too hung up in the optimized numbers. As we have seen in these studies, the optimized parameters determined from limited data are not likely to be optimal under all market conditions – however, they might still be considered robust in that they should still generate acceptable, if not optimal, results.

I rest my case that my original un-optimized parameters are still robust – they are not optimal for the 2006-2015 period but they still generate acceptable returns and perform better over the 20 Year 1995-2015 period than the “optimized” parameters.

David and Ernie

John Dishman says

David & Ernie,

Again, very informative! I wonder if you’ve seen the article by Tucker Balch, the Georgia Tech (my alma mater) professor who teaches a very popular online course on computer-aided investing. Here’s a link to his blog on backtesting:

http://augmentedtrader.com/2015/04/27/9-mistakes-quants-make-that-cause-backtests-to-lie/

A quote that stands out from this entry is:

“By the way, even though it could be called “out-of-sample” testing it is not a good practice to train over later data, say 2014, then test over earlier data, say 2008-2013. This may permit various forms of lookahead bias.”

I’m not sure what he means by “lookahead bias” in this context, but your results indicate, as you’ve stated, that optimization in one period doesn’t necessarily mean optimization in another period. Your work in going back to your “non-optimized” look back periods seems to be consistent with Balch’s statement. It appears your “non-optimized” (for the later period) parameters were more optimized for the earlier period. I wonder what would happen if you went back using the mutual funds you show for the earlier time frame ( 1995-2004) and did a true optimization of the look back periods. How different would those look back periods be from the current optimized values used in the Kipling 8.0 model.. An interesting exercise but maybe too much work for the payoff achieved.

John D.

HedgeHunter says

John,

Thanks for the reference – anyone considering back-testing should be aware of the pitfalls outlined and this is an excellent review of those pitfalls. I was totally aware of the “less than desirable” choice to use older data for the Out-of-Sample tests but there wasn’t much choice since our focus on this site had been on current use of “modern” ETFs – and this data is limited. As you suggest, we could go back and use the 1995-2004 Fund data for optimization and test the parameters generated there to test on 2005-2015 data – but this is getting a bit academic and probably results in the same conclusion i.e. that parameters optimized on one set of data will not be optimal for a different set of data – so it’s probably not worth the time and effort – time is probably better spent in looking for “ideas” that might offer an improvement to the basic concepts rather than tweaking the parameter values.

I totally agree with the author’s statements:

“Simple approaches that arise from a basic idea that makes intuitive sense lead to the best models. A strategy built from a handful of factors combined with simple rules is more likely to be robust and less sensitive to overfitting than a complex model with lots of factors.”

David

HedgeHunter says

A final comment on the issue of “lookahead bias”. This refers to the fact that optimization is based on the availability of forward-looking information that would not have been available at the time of the earlier tests – therefore might be biased. As I state above this is a “less than desirable” situation from an academic point of view – but difficult to quantify in terms of detracting from validity. It might be more of a concern to me if the tests using earlier data were strongly supportive of the “optimized” values. As it stands, I might conclude that knowing what the market will look like ahead of time doesn’t help me decide what I should do today 🙂

David

James Vitale says

David and Lowell:

Very interesting and thank you. Whew, didn’t quite absorb all of it – but quick question…..The results of the Rutherford portfolio back tests that you reported on October 13, 2014 had a 19.4% CAGR; 7.4% MaxDD for a 2 asset/equal weighted portfolio; test period was 6/30/06-6/28/15.

These results are significantly different than those reported above for the (almost) same time period. Should the backtest results from October 13, 2014 now be considered invalid??

https://itawealth.com/2014/10/13/rutherford-portfolio-back-tests/

Many thanks and good luck,

Jimmy V

HedgeHunter says

Jimmy,

No, the results of the manual, single run back-test reported in the October 13 post are not invalid – they just represent one of the curves in the top quartile of curves similar to those shown in the second figure of the above post – this is a consequence of the check date “luck” factor. Results are not directly comparable since assets are slightly different and parameters differ from the original test, but this is a great example of why we caution members to be careful, and skeptical, of performance back-tests presented by anyone presenting performance for any system where an element of “luck” may be involved. In this case, the luck element is represented by the check date sequence – over which we have no control. Even if we say we will review every 33 days, does this sequence start on the 1st day of the month or the 30th? – outcomes will be different. This is not such an issue if we are testing a system with portfolio “switches” based on signals that may occur on any day – but for an “investment” system that reviews on a calendar schedule (remaining in “neglect” mode between reviews), it becomes an uncontrollable parameter that introduces uncertainty into final performance outcomes.

Bottom line, if using the Rutherford portfolio of assets it is reasonable to assume that one should generate a return of ~ 12.56% if using the parameters used in the October 13 test (3rd figure from the bottom) – but, if lucky, it is not unreasonable to realize a shorter term (10 yr) return of 19.4%..

I hope this makes sense. I’m sure that most of us would be quite happy to accept an average CAGR anywhere north of 10% – it is fruitless to worry about the accuracy of these returns to 2 decimal places – they are just numbers that come out of computer simulations – don’t get hung up on them unless they’re negative 🙂 .

David

James Vitale says

Dave: Understood and thanks for the reply.

Jimmy V

James Vitale says

David

Back at ya…..Given the results from the tests in this blog, would you now suggest that the maximum dd for the Rutherford portfolio is 30%+; average drawdown 20%? vs. the -7.4% max dd noted in the Oct 2014 backtest……..?

Many thanks,

Jimmy V

HedgeHunter says

Jimmy,

Yes, assuming that the Rutherford ETF assets behave in a similar way to the funds (good assumption) I’d plan for a 20%DD. We don’t have any SD data related to the drawdowns and 7.4% looks very optimistic and anything north of 35% very pessimistic. This assumes a 33-day neglect mode is adopted.

Use of Option hedges and/or stops that trigger mid month MAY help reduce this risk – but that’s not for everyone, so, assuming a KISS scenario I’d work on the assumption that you’re likely to see a 20% DD over a 10 yr period. If this is a big concern then I’d certainly recommend putting an emergency circuit breaker in place – even if it negatively impacts the upside returns – that’s generally the trade-off.

David

James Vitale says

Dave

Good call; thanks

Jimmy V