Back Testing Issues: Remain Skeptical of Results

January 12, 2015 By Lowell Herr

Ernest Stokley was kind enough to spend time learning a new software language in order to run back-tests on ETFs used to populate ITA portfolios.

There is an abundance of noise or uncertainty when analyzing data for back-testing. Here are a few examples.

Actual Buy and Sell prices will not match back-testing Buy and Sell prices.
Variations in start and end dates will impact results.
Back-tests used end of day prices while portfolio managers are buy and selling securities throughout the trading day.
The selection of the review period (33 days for the ITA portfolios) will impact results.
Portfolio managers using limit orders will lag orders placed in back-testing studies as the order may not be struck for several days, if at all.
The inclusion and frequent use of TLT may be an outlier as it is highly unlikely TLT will repeat the performance it had over the past eight years. This is a specific example of how back-tested results can be misleading.

Ernie ran a number of tests that show us that we need to be careful not to “hang our hats” on one set of results. In other words, be skeptical of back-testing results.

I’ve extracted two paragraphs from Mr. Stokely’s article which is no longer available.

“The conclusion of both noise tests is that Monte Carlo runs should be made when looking at the absolute return numbers from any single back-test of a given portfolio management technique. Serendipitous choices of date sequences can result in either highly inflated expectations of a given technique, or it can give unrealistically low expectations for that technique.

One of the things highlighted by this study is the sensitivity of the method to noisy changes in the parameters. Because of the compounding effect, noisy influences can become magnified. A side note of this point is that small fortuitous investment decisions can pay off handsomely in the long run through the compounding effect, a fact that is well known but was not fully appreciated by the author until this study.”

Anyone who has worked with the uncertainty of measurement understands exactly how the “noise” is compounded, particularly when values are multiplied or divided.

As a counter point, remember that back-testing is about as good as it gets when one is trying to develop robust portfolio management models.

(Visited 786 times, 1 visits today)

Discover more from ITA Wealth Management

Subscribe to get the latest posts sent to your email.

Comments

David Brown says

January 12, 2015 at 1:41 AM

Excellent work, thank you for putting the time in to complete this.
My take away, all of the Rutherford monte-carlo results still beat buy and hold.

Loading...
- HedgeHunter says
  
  January 12, 2015 at 1:22 PM
  
  David,
  
  I think you’ve got the right message – and this is what I mean when I keep referring to a “robust” system – you can torture it many ways and the outcome is still “acceptable”. In this particular case I think very acceptable with an average “expected return” of ~ 240% over 8 years 🙂
  
  David
  
  Loading...
  - Ernest Stokely says
    
    February 5, 2015 at 12:24 PM
    
    Totally agree with David. When I do a run now I am looking primarily at 3 things: how well behaved are the different curves (i.e., even if the mean looks nice I don’t want a method that has wild swings of the back test curves), how does the mean return look in reference to an index, and finally, what is the standard deviation of the return, telling me something about the volatility and lack of robustness?
    
    I would caution against putting too much emphasis on just the average return number of the Monte Carlo runs. Just remember … you could be riding all or part of one of those lower individual runs just by picking a few wrong days to balance your portfolio using the momentum method!!
    
    Ernie
    
    Loading...
    - Ernest Stokely says
      
      February 5, 2015 at 12:30 PM
      
      ps I forgot to say in any of my replies that I am using “long only” portfolios and have not implemented shorts. Could be done but I have not done so at this point.
      
      Ernie
      
      Loading...
Herbert Haynes says

January 12, 2015 at 6:02 AM

Excellent work, Mr. Stokley!

Loading...
HedgeHunter says

January 12, 2015 at 6:23 AM

This is an excellent Study – I highly recommend that all Platinum members read it carefully and get an understanding for the strengths and weaknesses of backtests. This applies to all backtests – not just those presented on this site for our portfolios, but for all backtests – even for Gary Antonacci’s Dual Momentum backtests in his book.

Ernie has done a great job here – it takes me ~ 1 day to run a single backtest manually using the Ranking SS!

David

Loading...
VernFighter says

January 12, 2015 at 6:34 AM

Great Pic Lowell!!

Loading...
- Physlab says
  
  January 12, 2015 at 9:21 AM
  
  Vern,
  
  I tried to find a photograph that I thought fit the concern for back-testing results. Perhaps it was extreme, but at least interesting. As I recall, the photo was taken on a street in Budapest, Hungary.
  
  Lowell
  
  Loading...
Antonio Lapicca says

January 12, 2015 at 8:37 AM

Very interesting, nice you have coded in R the ranking code. What libraries did you have to download. Also would you mind sharing the code?

I had the same idea, but only through with half the work: these are my notes (not as thorough as yours!):

http://1drv.ms/1BZd4qm

Loading...
- Antonio Lapicca says
  
  January 14, 2015 at 12:38 AM
  
  I took my time reading the document Ernest Stokley wrote, this is great research!
  
  I must confirm that momentum works, but you have to select the portfolio on which to rank the securities orETF very carefully.
  
  The montecarlo sensitivity analysis is a wonderful way of looking at what to expect from future returns when applying a trading model.
  
  Thank-you for sharing
  
  Loading...
  - Ernest Stokely says
    
    February 5, 2015 at 12:12 PM
    
    Hi Antonio,
    Sorry for the delay. I have been head-down in the code and not checking ITA Wealth lately. I would be happy to share the code, but right now it is so hacked up from trying many models for portfolio selection and portfolio allocation, it would be pretty much useless to anyone else and an embarrassment to me to put the code out there in its present form. It would be no easy task for me to clean up the code enough that anyone else could use it, so for now I’m sorry, it’s not available. One final thing. If you don’t know R, you won’t be able to run the code even if it were cleaned. I don’t know of an easy way to hide R behind a GUI and make it easy for the non-R user to utilize. Maybe others on the thread can point me to some resources for doing that. So, that is another excuse for not disseminating the code. I sent a copy to Lowell, but I think he would agree it’s too much of a time investment to even learn the R environment so you can run the code I sent him.
    
    Yes, you are correct. I have had the experience, having several different portfolios with different brokerage houses, that not all EFTs are created equal! Another thing I have substantiated (so far) and that is something HedgeHunter has pointed out on the site before. Large baskets of ETF to choose from perform worse than a few (< 12) well chose, uncorrelated assets from which one applies the momentum/portfolio selection methods on. I hope to look at this in more detail at some point.
    
    The back tester has been an eye-opener for me! Now I know where these financial managers get their advertising material from. They find a back-testing curve that is produced by “serendipitous" selection of checkup dates and that’s the one they use. Nothing illegal about that. Just a bit unscrupulous. Also, I would never spend money now to buy time on one of the back test sites where you get a single curve. Sure, it will give you some ideas about your method, but you don’t know how that single sample curve stacks up with the variables you are dealing with in managing the portfolio (like when you do the periodic adjustment).
    
    Please seem reply to Bernard below for more musings.
    
    Ernie
    
    Loading...
  - Ernest Stokely says
    
    February 5, 2015 at 12:19 PM
    
    answer to your first questions:
    
    Libraries: zoo, TTR, tseries, gplots, fpc – I thought I was using quanmod but it’s not in the list. ??
    
    If you are using R, you know there is a back-tester in one of the libraries. It just didn’t give me the flexibility I was looking for and I wanted to first emulate 7.1.3 for validation and comparison purposes. So, I struck out on my own.
    
    Your results are very nice and much neater than my own! Nice work. It looks like you could wrap your current code in a loop that does the Monte Carlo runs if that is your main interest.
    
    Ernie
    
    Loading...
Bernard Kruyne says

January 12, 2015 at 10:05 AM

Excellent work Mr. Stokley. When I spoke with David F about backtesting he also mentioned R, but I did not realize at the time that you were already well on your way to master this software language.

I share some of the apprehensions about backtesting that your work has so clearly identified. I thought that it might be caused by data issues. Interestingly you did not mention data integrity in your word document. That leads to my first question. What data source do you use and how does this data source deal with distributions and splits/ reverse splits?

Surprisingly if I understand your document correctly, you use a formula to convert calendar days to trading days. I imagine that will still remain a source of inaccuracy, given the definition of a trading day.

When I discussed my back-testing approach, still using Excel and Yahoo data, with HedgeHunter, I thought that I could do a number of manual backtests based on the trading days reported in Yahoo, in a time efficient manner through Excel based on Yahoo data. I have since suspended that approach as it causes my PC to work too slowly and hence makes all activity frustratingly time intensive. Your example of mastering R to improve the back-testing will be my inspiration to learn Python, yet another language, which is necessary to access the data at Quantopian.com I believe from what I have read about Quantopian, that it offers a unique high integrity data source. ( although the site is more aimed at ( algorithmic) day traders. Furthermore, I believe that the Quantopian approach allows an approach where the thoughts and skills of a growing group of similarly interested people can be mobilized / solicited. For us at ITA, it offers the possibility to compare yet another time efficient back-testing method with the two already discussed here.

I have as yet not read any other Platinum ITA members refer to Quantopian and Python. If anyone on this site reads this and does have some experience with either, I am VERY interested to hear from you.

Similarly in this forum there are several readers of the work of William J. Bernstein. In one of his recent books Rational Expectations he also touches on data and data integrity. Confirming the lack of quality of yahoo data and the availability FOR FREE of the Kenneth French data library ( http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html ). I am still reading the book, but intend to check the usability of these data when I complete the book, perhaps an option to avoid learning a new code? Does anyone have experience with the Ken French data source to which Bernstein refers?

Like HedgeHunter, I have also escaped the uncomfortable South Ontario cold. Learning more about Quantopian and Python I will make my activity program while on the Costa Blanca in Spain. Python must be easier to learn than Spanish and also likely have a better impact on my liquids intake while here. If and when I have any back-test results through that approach I will share with this group.

BeDutch

Loading...
- Ernest Stokely says
  
  February 5, 2015 at 11:57 AM
  
  Hi,
  Sorry for the long delay in a reply. Let me take your questions one at a time, as I understand them.
  
  First, R. I undertook this as a challenge for an old guy to learn a new language. My background in computer languages began with FORTRAN 63 and it more or less petered out with C++ and the rise of OOPLs. I am far, far from being anything even close to an accomplished R programmer, but one of many advantages of R is the copious amount of help material on the Internet. Many of my R statements are taken from forums on the language. As further background I wrote a parallel workbook in Excel similar to 7.1.3 but added some things I wanted to look at like MPT and the ability to manually choose the selected portfolio and then look at its metrics (beta, alpha, Sharpe ratio, expected return, etc.). I developed a mild hatred for VBA in that process. Dealing with R and the steep hill of learning the language is like a breath of fresh air compared with hacking VBA. It is a very powerful language … quirky, but powerful … and I recommend it to anyone with an interest in programming, especially programming financial models. Python is a sister language, and although I don’t know Python it apparently has many of the structures that R has.
  
  I can run in about 6 seconds a back test that takes almost a day with the current 7.1.3 back test setup, but in fairness to David I think he would agree that this workbook has been cobbled a bit to do back tests and if one were designing an Excel back tester, you wouldn’t do it the way it is implemented in 7.1.3. The difference in these speeds, btw, is not CPU time, it is the manual cutting and pasting and manual operations required in 7.1.3. For example, I read all of the data over a large time range into the program and write it to a file. I can then quickly access that file and not have to download from the Internet to access closing prices.
  
  Converting calendar days to trading days – I’m not sure I understand your point. The main thing I discovered when trying to validate the back tester agains 7.1.3 was it makes a HUGE difference how you “sample” the market with the periodic updating! This was a surprise to me and I believe to David. It points out a couple of things: a) investing has a high degree of luck in it even with the best algorithm for trading and b) compounding of the investment means unusual gains or unusual losses get propagated forward in time making a bad trading loss difficult to recover from (said very clearly by Faber in his Ivy book, “Never lose!”). So, if you pick X calendar days to perform some task and you convert that to Y = 252/365 * X trading periods/days, I don’t think it matters for what we are doing here. I go in and add random noise to the day the portfolio is checked to create the curves in the report, so the exact day when a process is begun is not important. But, maybe I misunderstand your point.
  
  As for data, you raise a good point … to a point! 🙂 I am using Yahoo finance downloads, as yes, I have heard there are errors and that Yahoo sometimes updates some of the data sets. I am using adjusted closing prices for all computations. The reason I say “to a point” is I am more focused on looking at relative issues … relative returns, relative stability, relative robustness … and as long as I am using the same data set I don’t think it matters for my own purposes.
  
  I am no longer so concerned about reading the absolute return number from a back test. If the mean is above a chosen index fund, if the curves are well behaved, and the standard deviation is controlled, for me at least that is what matters in drawing conclusions about some trading algorithm.
  
  One final word. As I was puttering with the back tester and trying things I accidentally entered a set of parameters that amounted to a buy-and-hold strategy over the past 5 years for a U.S. equity ETF. At first I thought I stumbled upon the Holy Grail until I got a very good laugh at what was happening. If you pick the right sector in a long bull market you will make a LOT of money!!! But, wouldn’t we all like to know how to do that and still control risk?
  
  I hope this addresses your questions. Thanks for the post. Go for Python. I would like to hear about you experience. I will stick with R and keep trying to understand its intricacies like, e.g., it several data structures that are different from any language I have encountered.
  
  ErnieS
  
  Loading...
Lowell Herr says

February 5, 2015 at 12:34 PM

Ernie,

Thanks for the detailed responses. I wanted to extract and amplify on something you wrote above. Here goes.

“a) investing has a high degree of luck in it even with the best algorithm for trading and b) compounding of the investment means unusual gains or unusual losses get propagated forward in time making a bad trading loss difficult to recover from (said very clearly by Faber in his Ivy book, “Never lose!”)”

Take the Copernicus and to a lesser degree, Pasture, as portfolio examples. When these portfolios were launched, I assumed the market was quite high so I did not recommend investing all the available cash in the ETFs that make up the Strategic Asset Allocation plan as laid out in the Dashboard for each. Big mistake as the market continued to march forward and up. It is going to be very difficult to recover the divergence between the portfolio IRR and the market IRR. When the market declines, these two cash heavy portfolio gain ground on the benchmarks. But when the market moves up, and that is the general trend, both portfolios lag even further behind the market.

The same holds true for setting limit orders. The market can easily walk away from a limit order if the buy price is set too far below the current price.

Lowell

Loading...
Antonio Lapicca says

February 6, 2015 at 3:21 AM

Ernest
may I reccomend you have a look at these two links:
1) http://www.rstudio.com/;
2)https://systematicinvestor.wordpress.com/

Loading...
- Ernest Stokely says
  
  February 6, 2015 at 7:15 AM
  
  Thanks, Antonio. Yes, I did the development with RStudio. It’s a great package for development and I think it is a MUST for anybody who wants to learn R or do any kind of programming in R. It makes debugging and exploring “what-ifs” very easy. The 4 windows give one a command line console, an editor with the ability to run code line by line, a help/charting window, and a window that shows the value of the variables as they are being run. It’s a powerful development environment.
  
  Yes, I have communicated with David F. about David Varadi’s work with momentum. I have been so involved with the development of the back tester and using it to explore the momentum methods of 7.1.3, I haven’t had time to dig into Varadi’s work. There seem to be many methods for generating the momentum signals. It would be nice if there were a paper out there comparing them, the pros and cons of each. Do you know of such a document? HedgeHunter, if you are reading this post, can you comment on your experience with momentum algorithms?
  
  Ernie
  
  Loading...
Antonio Lapicca says

February 10, 2015 at 3:09 AM

Hello,

if someone is interested there is a Free Stanford online course on Statistical Learning (with R) starting on 19 Jan 2015:introductory-level course in supervised learning, with a focus on regression and classification methods.
https://class.stanford.edu/courses/HumanitiesandScience/StatLearning/Winter2015/about

Loading...
VernFighter says

February 16, 2015 at 11:47 AM

Hello Ernie,
Was just wondering if you have run any backtests using the asset classes that show high absolute acceleration and are on the bottom of the 90/180 day ranking list? My thought would be to buy assets that are “on sale” but show positive movement that might have put in a floor? Thanks for all the work and contributions!

VernFighter

Loading...
Ernest Stokely says

February 16, 2015 at 2:17 PM

Hi Vern,
No, I have not but that is an interesting thought. Are you saying you take the momentum rank (as per 7.1.3) of, say, of the bottom N assets and you pick the top M of those where “top” are those in the N set having the highest acceleration?

Ernie

Loading...
- Lowell Herr says
  
  February 16, 2015 at 7:15 PM
  
  Ernie,
  
  I thought you ran a back-test on selecting those ETFs with the top absolute acceleration and the results were not all that great. Perhaps I am mistaken.
  
  Lowell
  
  Loading...
  - Ernest Stokely says
    
    February 17, 2015 at 7:19 AM
    
    Hi Lowell,
    Yes, that was what I saw when I used acceleration in place of the ROC-volatilitly metric for choosing the portfolio, then using equal weighting for the allocation. But, Vern suggesting to pick the *bottom* N assets from the ROC-volatility metric step and then use acceleration to make the final selection. Is that correct, Vern?
    
    I would suggest postponing this experiment for a short while. I am working with HedgeHunter on attempting to construct a more realistic market return scenario than the last 8 years we are using. We are all rather skeptical about the results of backtesting in this sustained bull market where U.S. equities have dominated. It’s a very different segment of the stock market from anything in recent history and backtesting with this scenario may well lead us to some conclusions that are not valid once quantitative easing has stopped and the off-shore markets begin to sputter and come to life. The other thing that is weird about the recent market is the lack of diversity much of the time between the sectors. So, I suggest we postpone for a time further back tests until we see if we can artificially create a more realistic 8 year market segment. More on that later if anyone is interested.
    
    Loading...
- VernFighter says
  
  February 17, 2015 at 5:44 AM
  
  Yes Ernie, something like that…not sure if changing to a shorter timeframe such as 15 day/30 day might be better (in order to get the asset early) or letting time smooth out the selection process?
  
  Loading...
  - HedgeHunter says
    
    February 17, 2015 at 6:24 AM
    
    Vern,
    
    I think using something like 15/30 day momentum is too fast for a “simple” momentum “investment” strategy – there will be far too much “trading” and whipsaw action. As noted in my recent Rutherford portfolio review (https://itawealth.com/2015/02/16/rutherford-portfolio-review-february-15-2015/) TLT and VNQ would be at the bottom of the rankings rather than at the top (using the 91/182 day periods).
    
    As Bob W has reminded us, most investors prefer to keep things “simple” – so I think 15/30 is not a practical option for a “simple” system. However, for investors prepared to put a little more effort into risk management, 15/30 may be a possible option in setting an exit/protection signal. Of course, even simpler (risk management) options might be considered e.g. a Moving Average crossover or the “standard” ITARR (195 EMA) stop.
    
    I think using 15/30 and absolute acceleration as an entry would probably not meet the “average investor’s” objectives or even be effective – the latter is a “gut” feeling – I’ll leave it to Ernie to provide any evidence he might have).
    
    David
    
    Loading...
    - Ernest Stokely says
      
      February 17, 2015 at 7:32 AM
      
      Vern and David,
      I tend to agree with David, but see my post earlier above. I have played around with look-back values for the 7.1.3 momentum model (values currently used are 91-182-63, where the first two values are calendar days and the third is in trading days). One set of values I arbitrarily tried with the back tester was 30-90-25. I never went as short as 15 days. the 30-90-25 combo did not perform well at all. But, I would say that is an anecdotal observation and I have not tried to exhaustively test the look-back values.
      
      Ernie
      
      Loading...
      - VernFighter says
        
        February 17, 2015 at 3:04 PM
        
        No problem, thanks for the invaluable insight, and I look forward to the results of further back-tests!
        
        Loading...