## “Of Course Everyone Should Learn from the Past”, He Had Said## False Tests of Discovery Over ChanceHere you are going to be simply reading about some of the basic principles of “hypothesis testing”. That's what statisticians call it. If that sounds too boring you might consider that it is someone's life savings that are at stake, perchance yours. No kidding. This isn't an idle exercise. There will be a certain amount of specialization to suit our particular purposes, but the principles are applicable to diverse fields of inquiry. Such as what? Beyond finance, markets and economics... how about medicine? (And soon, for your entertainment and enlightenment, we'll specialize to painting houses!) And with medicine, the development of pharmaceuticals for example, while the testing that goes on isn't called “backtesting” it is rather analogous to that which we must pursue with financial markets. The hypothesis is that drug XYZ would alleviate a certain condition and the company hopes to go forward with selling the drug. So prior to the company's managers making their decision to proceed, backwards in time from that, volunteers were given the drug and the results noted and analyzed. In finance we don't put volunteers at risk when we're backtesting. But we put ourselves and customers at risk if we yearn so hard for a wonder drug, so to speak, a magic potion that will put us on Easy Street, that we do our backtesting in a stupid way. Unlike the pharmaceutical companies, with computerized portfolio management we're pushing an “algorithm”— that's really just a giant formula that computes the position sizes that you should have for each security going forward, from one trading day to another (or from one month to another if you prefer)— and our hypothesis is that if we use the algorithm then the condition of our being poorer than we'd otherwise be will be alleviated and so we should go forward with it. In place of volunteers we have imaginary versions of ourselves using the algorithm in past years and ask the question “If those imaginary versions of us had done back then what we propose to do now, how would they have fared?” It sounds so straightforward. In a way, it is. Those who have thought about it a great deal can usually get it about right. But they who don't are legion. And you can find them everywhere— all over the Internet, and too often at the office of your financial planner or investment advisor or even at a university. Indeed, the failed analysts are so numerous that one person of experience at doing some of what I do has notified me that the very word “backtest” is now utterly toxic, like garlic to a vampire. That would be thanks to inept practitioners who either didn't know what they were doing or didn't admit to the residual uncertainties that cannot be removed regardless of the analyst's skill and understanding. So then, does the fact that some backtesters have blotted their copybooks have you wanting to proceed The rub is that there's always the possibility that those imaginary versions of us, doing back then what we propose to do now, were just lucky— that whatever success they had with our algorithm would have been due entirely to chance (if somehow they had way back then gotten the idea of using the algorithm— they didn't— that we have only recently invented after very possibly having been influenced in some way by what has happened since). This gets interesting. We'll deal with it. ## An ExampleThis won't be a practical example; it is almost ridiculous; it's really a parable. And so it is entirely made up out of whole cloth. It's about painting houses. Real painters know what to do and don't need this kind of help. But we don't want any previously-developed convictions about financial markets acting as intrusive thoughts, not as we begin to go about figuring out the tricky parts of backtesting. Hence painters, not brokers. So suppose that you're a big-time painting contractor, and you have a lot of new work coming up, starting about now. One of your standard paints is in reality a two-part system, the basic paint plus an additive that is needed to make the paint cure or dry to a nicely hardened finish with good coverage. So you want your painters to put in the right amount of additive, and you've been keeping records of the different amounts of additives that your painters have used and their results. Let's say that the painters have been using between one and five measures of the additive per gallon, as they see fit (the measure could be some little thing like one of the smaller kitchen measuring cups). And then your inspector has some way of measuring the hardness of the paint after it's had time to cure and dry and also some scheme for grading the quality of the coverage and so is able to come up with a combined hardness-coverage score as a measure of the quality of the result. So let's say that the quality scores are either “bad”, “acceptable” or “good”. Suppose also that you the contractor have been collecting these data, the amounts of additive used and the quality scores, for quite some time— for years. And so you're finally getting around to deriving from the data the best amount of additive to use going forward, the opinions of your painters notwithstanding. Of course you're going to pick the number of measures per gallon that historically yielded the highest percentage of “good” quality scores and go with that. Right? What else? And so with that approach did you the painting contractor live happily ever after? We'll soon see. Mike O'Connor is a physicist who now develops and tests computerized systems for optimizing portfolio performance. ## Actual Painters, 10; Contractor, 0So did the houses all get painted very well, with our contractor not having to pay for redoing any of the paint jobs. Sadly Now the contractor had the idea that the basic purpose of the additive being supplied separately by the paint manufacturer was so as to give better shelf life to the paint. That is, had the additive come from the factory already in the paint the paint would start to cure in the can if not used quickly. Indeed, that was actually part of the theory of use of the additive. And when he saw from the historical data that the best-performing number of measures of the additive produced a percentage of “good” ratings— we'll call that percentage P— that was only about 10% greater than the total percentage of “good” ratings in all of the dataset, he was not alarmed as he had always thought that while there would be an optimum amount to use other variables that were beyond the control of his painters could affect the results (e.g, the condition of the siding, the nature of the residue from the prior paint job, the humidity, etc.) and could therefore fairly often cause an other-than-historically-optimal amount of additive in use on any one particular house to perform as well or even better than the historically-optimal amount. That plus the fact that in the modest 10% margin he smelled real money, in the form of reduced costs for repainting that would accumulate, caused him to go ahead and bet the farm on the historically-optimal number of additive measures. But as it turned out, it was actually Not understanding any of that, the greedy contractor— we have to call him “greedy” or this would not be a proper parable— fixed the number of measures to be used per gallon at the historically optimal number. Given the fluctuations the thus-determined number could have been anything. But it happened that it was a number of measures of additive that was suitable for cool weather and he imposed the use of it just prior to a busy summer month. He became bankrupt and when last seen was living in his mother's basement. MORAL: If you're going to beat the professionals at their own game, first get your logic straight. ## Some Proper ProceduresWere there rules of hypothesis testing, of backtesting, of logic that the contractor should have followed? Hah! You could say that. The first rule is to remember to actually test your hypothesis. ## An Ex Ante Testing Regime: the “Walkthrough”That's right, the contractor tested absolutely nothing. He used the history to set the number of measures of additive going forward, but he never tested what happens when someone proceeds that way. The astonishing thing is that many who promote some system or other of theirs for trading stocks only do the like of what the contractor did, yet brazenly call it backtesting. There are even online sites with associated brokerage accounts that encourage customers to use the site's software to “backtest” on their own (all the better to shirk blame), where the software only finds out what would have happened if particular parameter values had been used in the past. The customer is led to find parameter values that would have worked well in the past, if only back in the past the future good performance of those values could somehow have been known. (Continued...) |

If the contractor had painted a few houses and then stopped to check the results that would have been a bit of an actual test, but to do a very good test or two it By test your hypothesis we don't mean use real money, as the contractor effectively did going forward into the summer. No, no. We mean seeing what would have happened in the past had you used the scheme that you have just now defined back then— and it will be with He would find that things didn't go well the month following most of those prior months at those houses where that supposedly optimal amount of additive was used— for the reasons that are explained at the end of the previous page. Note that we said that the painters were good, but not perfect, and so there would be a sufficient number of painters who used the supposedly-optimal number of measures of additive even when some other number of measures would have been better, who therefore got bad finishes, so that the contractor could have gotten enough data to have seen the error of his ways and to have figured out what was really happening. With regard to our Certainly with the stock market we will always be able to see how we would have fared in the past with any specified program and the walkthrough method that we have just defined will be one of our mainstays. We'll apply it to the determination of parameters that are embedded in our scheme just as the contractor should have applied it to the number of measures of additive. ## Refuting the Null HypothesisIn “Discovery or Chance? Part B” we go on to apply the walkthrough method to our version of the momentum and relative strength approach to portfolio management, but now we introduce another, supplementary method of ensuring that we don't adopt unreliable schemes. Ultimately it's one that will help us understand how good the scheme that is derived from the walkthrough really is, before putting it into practice. Here we can get into some fairly difficult statistical science if we go very far with it. Moreover, it's pretty easy to see, with any use of it, including the analysis of the financial markets, that the null hypothesis idea that we are about to consider has some hair on it. You can worry about whether or not you're doing it right. There are indeed different conceptions and implementations of it in academic literature on finance. But there is clearly a basic duty, to yourself if you are the backtester, to roundly So the idea is that for the painting contractor's scheme to have been viable based on the historical data, the data must have been such as to demonstrate in a statistical way that it's quite unlikely that the null hypothesis could also account for the outcome that the contractor found (recall that his outcome was simply that a particular number of measures of additive was associated with more “good” ratings than others— that could hardly be regarded as unexpected... what else?— and that the margin over the percentage of good ratings among all of the paint jobs was a fairly meager 10%). If the null hypothesis can't be ruled out, then obviously you can't go forward with implementation and use of a scheme with respect to which the null hypothesis is utterly antithetical. And the matter then arises— this is amusing— that we actually need to set some criterion concerning how roundly we should require the null hypothesis to be defeated before adopting and implementing our own hypothesis with respect to which it is antithetical. It's amusing because after all of our hard work at avoiding unwarranted assumptions we will find ourselves called upon to Given that there would have been variances or fluctuations due to unknown causes, it's clearly a tall order to refute the null hypothesis with the contractor's historical data in the face of the stated rather uniformly good performance of the painters and the fact that the largest margin of benefit that he found for one number of measures of additive over the average performance was a mere 10%, but there are mathematical ways to get the answer whatever it is— for the case that we stated of highly-skilled painters or if the painters instead often used the wrong amounts of additive. ## The Origin of the Null HypothesisSir Ronald Aylmer Fisher (1890–1962) was a very notable English statistician, hence “Sir”, who put forth the use of a null hypothesis in hypothesis testing and figured out such things as how to calculate the odds that a given data set such as our painting contractor's with its five additive-amount categories and three assigned finish-quality categories is what it is due to chance alone. You can read about the null hypothesis idea here. It's the basic Fisher conception of it and of its use that we'll adopt. We've already seen how the null hypothesis could plausibly be hard for the painting contractor to refute. Remember that the null hypothesis for him involves just supposing that any number of measures of additive used has the same probability of success. And we also learned that in reality the painters got approximately the same score distributions for each number of measures of additive thanks to their skill in using the right amount of additive regardless of conditions. Therefore we actually know without doing any math that assuming the probability of a painter getting a “good” outcome to be The good news is that for our own work with securities we'll be using a computationally simple and powerful method to attempt to refute the null hypothesis that's also easy to understand— because it's a bit like a simulation, just like the walkthrough method. ## Implementing the Null HypothesisWell, we'll soon be considering our second, more-pertinent example, about stocks and ETFs, in Part B of this article. That will be a welcome change. But when we get there we won't have just a few simple discrete categories to deal with. We'd like a way of refuting the null hypothesis that involves computing the odds that the data are what they are by mere chance that is easily understandable yet flexible and easily applied to a range of circumstances, using much the same computer code each time. Enter Monte Carlo. Yes! The casino... not the French count who wasn't one of the Three Musketeers. And we might as well also consider the alternative case of a painting contractor who, although not a smart cookie, is somehow smarter than some of his painters, some number of whom used the wrong amounts of the additive on numerous occasions. Even though his scheme was blind to the criticality of the proper number of measures and to the role of the temperature variable, if a number of his painters had been utterly terrible then the optimal amount of additive that he found from the historical data could actually have brought about a significant improvement. And had he used the proper walkthrough approach he likewise might have actually been met with improved results.
In such circumstances we still earnestly want to know the odds that the improvement could be due to chance. But it would be nice if instead of proceeding at once with a calculation of that we could first see in the walkthrough testing regime that we have already developed some reason to believe that it somehow tends to set us up to automatically avoid adopting hypotheses whose seemingly good performance could be due to chance. (Continued...) |

So, now with the walkthrough method applied to the contractor's data, if the verdict at the end of the month that followed But of course the contractor had many months of data. So if for two months in a row the contractor's optimum amount of additive produced better results, the odds of that being by chance fall to something like one in four (one-in-two times one-in-two, and see the next paragraph for the qualification that would give us exactly one in four). Looking better. You get the picture. The math gets more complicated if the results weren't better for every month, which would of course be the expected circumstance, but the general tendency here is that the walkthrough method, as it involves repeating the test many times, naturally diminishes our concerns about good results having happened merely by chance... provided only that the results are chiefly good. If the painting contractor had elected to only use a month's worth of historical data prior to each end-of-month then each of the trials would have been independent and if all that we were interested in was whether the next month's outcomes were better or not then the little math problem to compute the odds that the cumulative results were due to chance is the same as the one for determining the odds that flipping a coin F times produces heads H times where H may range substantially away from F/2. Thus the walkthrough scheme that we bestowed upon the painting contractor (and will also on all of our portfolio management programs) has a built-in tendency to indicate improbability of the null hypothesis if it shows substantial accumulated benefits over multiple trials. But we want odds, not just an indications of substantial benefits, because odds would amount to benchmarking our scheme's performance vis-à-vis the null hypothesis in a clear and meaningful way. And it is also insufficient for our purposes to classify the effect of the scheme in a simple binary way, as being beneficial or not beneficial. We're concerned with gradations. So where do we turn for help with this matter of chance? Back to the casino, of course! And while we won't be doing anything quite as simple as flipping coins the required computer programming is exceedingly easy. And again, the computer doesn't care about how many manipulations it has to do.
So for the painting contractor's scheme, with or without the walkthrough, instead of just supposing that the null hypothesis is true and then trying to compute the odds that the results would be what they were on that basis, à la Fisher, we'll And we will scramble the data many, many times, N times, recording after each scrambling the results that would have been obtained if the historical record had instead been the scrambled version of itself. For our purposes it would suffice to record, for every scrambling, the highest percentage of good ratings that was obtained using any of the numbers of measures of additive.
Now then, let us first see what we could make of the contractor's own simple intention of finding the number of measures of additive that worked best in the past and immediately committing to using that thus-fixed parameter in the future, without prior testing via the walkthrough (we have already seen how foolhardy that could be). We simply look through our recorded results for all of the scramblings and tally up the number S of scramblings with which If you have questions, you can always write to Mike O'Connor, and corrective comments are especially welcome. Also, if you know all of this to begin with you may find “Chance or Discovery? Part B” to be of greater interest as it completely discloses details of a tested program for portfolio management. We have already seen that with the painters being, unknown to the contractor, very good at what they do it would be exceedingly unlikely that the odds that the contractor's 10% margin of benefit was due to chance would be as low as 1 in 20, the usual criterion for significance. But if the painters were rather uniformly bad then that could indeed have happened. However, if it were to have happened then the contractor should not then have been that much assured because we know that he chose his best number of measures of additive with hindsight only— never having tested if hindsight works going forward, if the historically best number of measures could continue to work well. That is why the walkthrough method is essential.
It is our walkthrough method that tends to take care of that problem as it tests, every month, the extent to which the historically best-performing number of measures of additive continues to produce good results. And so how do we apply the Monte Carlo method to testing with the walkthrough implementation, subsequent to its application? It's simple. But first bear in mind that should implementation of the hypothesis with the proper, walkthrough approach produce bad results then we're done. There is no point in computing the odds concerning our bad results being due to chance because we've demonstrated that a simulation of what we want to do in the future that was as pure as we could manage didn't work in the past. So let's now take the case of the contractor who has a number of bad painters, who wisely did the proper walkthrough and who got outcomes that were better than the historical percentage of “good” results. What do we do? We just look at the recorded walkthrough results and thereby obtain the value P', the percentage of “good” ratings for the entire walkthrough, which now takes the place of the P for the contractor's simple scheme but it is now drawn solely from the outcomes at those houses at which the painters happened to have used the putatively optimal number of measures of additive that was determined from prior data in the historical record. We then form a list of the optimal numbers of measures of additive for each month that were thereby determined, scramble that list, and apply the scrambled list to the actual historical record of finish ratings— tallying again the “good” ratings that would have occurred in each succeeding month had the scrambled value been used instead of the value that had been found during the first walkthrough which was with unscrambled data. Then we compute an S' to take the place of S in the discussion above about the contractor's simple scheme, by repeating the walkthrough for each of many additional scramblings, where S' is now the number of scramblings (with repeated walkthroughs) with which the computed average walkthrough percentage of “good” ratings was equal to or greater than the “good”-rating percentage P'. And we compute the odds as before, S' in N, etc. ## Is This Really an Accepted Way to Proceed?So what about our plan to refute the null hypothesis by computing the odds that the “good” rating percentage that the contractor scored with his scheme or with the proper walkthrough version could have been reached or exceeded due to chance? That probability is actually called the “p value”. If you doubted that what we're doing has any counterpart in medicine, you should review this nice little tutorial from the National Institutes of Health. Confidence intervals are also discussed in that article— they are an alternative to p values— and we shall on occasion also have confidence intervals to offer. See for example the panel on the right-hand side of the table that is just below that chart at the top of this page. ## We Should Go with the Worst?If like the contractor we couldn't conceive of the walkthrough procedure, if we had taken his approach, we too would never in a million years have considered that it might be appropriate to do in the near future Let us first consider a terribly important matter that is glossed over if not utterly neglected in the discourse above: the virtual necessity of Why is this ultimately necessary? It's because otherwise we would be testing a different scheme every month... what happens when we use 50 prior months to optimize the number of measures of additive, what happens when we use 51 prior months to optimize the number of measures, 52 months, etc. Going from 50 to 52 isn't much of a difference but, with securities in particular, for which we might have, say, 20 years of relevant records, if at the start of our walkthrough we base the optimization on 2 years of data and then at the end base it on 20 years of data its clear that we're not testing anything like the same scheme. And as a practical matter we could well worry about the early optimizations with 2 years of data being based only on short-term tendencies and the last of the optimizations being insufficiently responsive even to intermediate-term tendencies. What would have happened if the contractor's painters were not good, if the contractor actually had the possibility of using the walkthrough method to improve his operations, and, if he had used So the contractor would have had chronically bad results using the 6-month trailing period to get the putatively optimal numbers of measures of additive. In fact he might get such consistently terrible results as to wonder if he shouldn't do the opposite and use the We could sit in a chair and ordain that the stock market should exhibit something simple like momentum if we like, and we might be right, but for every one of us there's another who wants to tell us that the market can “get ahead of itself” and be due for a correction, or that it can become “oversold” and be due to rally— the antithesis of the persistence of momentum. And if those other guys are right then it might pay to do the opposite and assume that any show of strong momentum will be reversed.
However, as the accompanying “Chance or Discovery? Part B” article explains, it was actually found that it — Mike O'Connor Comments or Questions: write to Mike. Your comment will not be made public unless you give permission. Corrections are appreciated. Update Frequency: Infrequent, as this article is not about current market conditions or other ongoing affairs. |