Go Ex Ante With Parameters and Refute the Null Hypothesis... or Else!
Hey! Is There a Momentum and Relative Strength Program That I Can Use?

Tests of Discovery Over Chance— Momentum

In this Part B we'll put into effect both the “walkthrough” method and the Monte Carlo procedure for computing p values that were introduced in Part A. And the application will be to a momentum/relative strength scheme for portfolio management that has support in academic research.

The logical explanation of the built-in tendency of the walkthrough method to, in effect, automatically reject unreliable schemes is explained starting at the very end of page 2 of Part A, under the sub-subheading Walkthrough Implications for Odds Due to Chance. It does so by yielding a reduced value for the result that we seek to maximize, which for securities will be the risk-adjusted return. In this business, that's called a “haircut”— getting your pet program's would-have-been-rosy projections cut down to size. Your analyst has a duty to tone down his projections by taking into account the fact that optimal settings that worked best in the past cannot be trusted to work as well in the future. But the toning down must be via a mathematically-reasonable procedure. The walkthrough method, by it's very mathematical nature, automatically and implicitly accounts for the bulk of the shortfalls that we should expect.

But, in reality we cannot commit the fixing of the value of every parameter that specifies our entire program to the walkthrough procedure as some of them define the walkthrough procedure itself! And so the values of those that define the walkthrough procedure must be determined ex post, after the fact, at the end of the entire historical record by taking a backward look at it. Consequently there is a need for a second haircut, this one explicit, to atone for the fact that we can't be sure that the walkthrough-procedure-defining parameter values that worked best in the past will continue to work as well in the future (actually it's unlikely that they will). Herein a way of estimating the second haircut is provided and applied.

A Look at the Parameters

We could use a summary here of all of the parameters as their further exposition below is rather spread out over a number of paragraphs. Here are the two parameters that are to be determined by the walkthrough procedure:

  • n, the maximum number of securities held. If we have a list of candidate securities from which we intend to form a portfolio, then if we are believers in the concept of “relative strength”, which is explained below, we don't want to hold every security on that list; we only want to hold the best performers. So in that case n should be chosen to be less than the number N of securities on the list.
  • lp, the lookback period. This is the trailing interval of time over which the recent momentum of the security's price action is measured. “Momentum” is explained further below.

These parameters are both like the number of measures of additive to be used, in the parable of the painting contractor. We'll allow the walkthrough procedure to select the values that perform best on a trailing basis. That means that the walkthrough procedure will rid us of these parameters. Post-walkthrough we're left with our entire program only depending on and being specified by the parameters that define the walkthrough procedure. And here they are:

  • The upper and lower range limits of the n and lp parameters that are defined above. The lookback period lp naturally is open-ended. How far toward infinity should it be allowed to range, and shouldn't we consider the possibility that too-short a lookback period could lead to volatile and low returns? Similarly, if we allow n to range as low as n=1, holding just one security, might that lead to too much volatility?
  • m, the characteristic period of the trailing exponential moving average (EMA) that is used by the walkthrough procedure to select the best-performing values of n and lp. The EMA and it's use is explained below. The case of the painting contractor of Part A only making use of the data for the prior month to determine the number of measures of additive to use in the succeeding month, and then later the other case of him using 6 prior months... those intervals of 1 and 6 months are entirely analogous to m as used here.

Note especially that the prospective ranges of n and lp can be substantial. If for example we start with an lp range of 12 months, 1-12, and then move the upper range limit of 12 up or down by 1 in the course of finding the optimal value for it, that's not a large percentage change in the range. Consequently in that circumstance we will find that the risk-adjusted return that we're optimizing ex post by adjusting range limits does not show much volatility with respect to such a modest adjustment of one of the range limits. If we contract the range to just a few values then we may however encounter volatility with respect to the range limit adjustments. And volatile or not we'd like to avoid greatly limiting the range as that would be expected to limit the sought ex ante effectiveness of the walkthrough procedure at implicitly administering an appropriate haircut and would complicate the job of determining the second haircut.

The walkthrough-defining parameters, m and the range limits, will be fixed by us for each portfolio by finding the values that maximize the risk-adjusted return over the entire period of record (and yes, that involves repeating the walkthrough procedure many times, with different defining parameters for each run). So we would like to think that the same values of the thus-derived walkthrough-defining parameters would also work optimally for any portfolio of a similar character (e.g., any other portfolio of equity ETFs). But... it isn't quite so. And by observing the failures of the optimal parameters for one portfolio to be quite as optimal when applied to another portfolio we can get a rough idea of an honest depth for a second haircut. Given the aforementioned subdued effects of variations of the range limits, we will find that the criticality in the risk-adjusted return is generally associated with the choice of m.

WARNING: This is a rather lengthly article and it begins with the program deliberately initialized in a non-optimal way— so that we can discover things during the subsequent optimization such as the depth of the aforementioned second haircut that we must administer, and so that readers of this article who might have contrary opinions about how parameters should be set will have something to get their teeth into. And while the results that this article concludes with are credible and better than what is presented at the start, they are not really terrific (very good risk reduction, but not so hot with regard to long-term enhancement of returns). RB's New Program, which is not what is presented here, does not make orthodox use of momentum and it seems to be much more promising. It is presently undergoing testing. There may also soon be an RB program that leverages off of what was learned here about momentum, that would be suitable for small accounts holding just a few or even just one ETF.

A View of a Walkthrough-Administered Haircut

Figure 1 tells the tale of what could happen to us were we to proceed à la painting contractor, versus making full and proper use of our walkthrough procedure. It's from real price data on certain industrial sector ETFs and it shows the effect of the fact that a hindsight choice of the lookback period lp that would have produced the best cumulative return at the end of the period of record could of course not have been identified as such in real time during any of the earlier years.

Walkthrough Results

Figure 1: The upper panel is the cumulative return, what one dollar invested in a portfolio managed with our walkthrough-assisted program becomes as time goes by. The vertical line selects the date of the data of the lower panel, here the last date of record. The lower panel shows the cumulative returns that would have followed from constant use of any given one of the lookback periods of a 1-to-12-month range.

On the lower panel the cumulative return that we would have obtained had we fixed lp at 11 months is shown. “x=periods:11, y=2.40” is the not-so-user-friendly computer-outputted label and it means with lp fixed at 11 months $1 would have become $2.40. But with our walkthrough procedure with which the program at every step only analyzes data from the then past, it only actually became $1.26— our computer naturally was unable to somehow figure out in advance that lp=11 would eventually retrospectively turn out to be the best choice. And during the same period of record a benchmark consisting of $1 divided equally among all of the ETFs, with frequent rebalancing so as to approximately maintain that allocation, would have become $1.97— much better than $1.26 . We might have been encouraged, ill-advisedly, by having seen the $2.40 exceeding the $1.97, not realizing that the return that should have been expected due to choosing the best-performing value of lp would likely be more like $1.26 . (For this walkthrough run n was kept constant at n=8 whereas the total number of industrial sector ETFs on the list was 17. And m was fixed at 2 months. With optimized ranges of n and lp values and an optimized m value our program does much better than shown on the top part of Figure 1.)

Mike O'Connor is a physicist who now develops and tests computerized systems for optimizing portfolio performance.

So we should disregard the cumulative return that would have been produced by the constant use of the hindsight-best choice for lp. It's irrelevant. And we could say the same were we to consider making hindsight choices of the very best-performing n, the maximum number of securities held (as is demonstrated in a different walkthrough haircut example that is presented below).

What would happen to us if we were to just proceed from here with lp=11, ignoring the logic of this article? No one could say. But what if after a time it seemed that lp=11 wasn't working? Wouldn't we then be called upon to re-run our test to find out if the best lp to use had changed? Oops! If we do that too often then we're doing something that's about half of the way toward adopting the walkthrough procedure. All that we would have to add would be the use of a trailing EMA instead of the entire historical record, all the better to be more responsive to shifts in market dynamics, and we'd be all the way there!

Momentum and Relative Strength— Two Peas in a Pod?

Now let's finally look at a stock market example of interest. We'll examine a fairly simple scheme that has elements that have been popularized by some advisors and pundits but that also has some support in academic literature on finance. It's so simple that we could worry that in using it we'd be quite a bit like the painting contractor supposing that professional painters don't know what they're doing— as though Wall Street wizards are stupid because they aren't using our really simple scheme. So we'll be very careful.

The Intuitive Start

Let's say that you have read about this thing called “momentum”, or the related idea which is called “relative strength”. Both of these involve the supposition that when the price of a stock or ETF has been going up vigorously, where the amount of vigor is calculated using some specified measure that is applied to the recent past, then it is more likely to continue going up than not. Relative strength is almost but not quite a corollary to the momentum hypothesis that goes further and asserts that therefore you should naturally only own the stocks or funds that are going up with the greatest vigor.

So in this article we're going to investigate the efficacy of momentum and relative strength trading as an example. It is not the best thing that we can come up with. But we'll discover some things about this approach, extend it, and conclude with a useful scheme for anyone who wants to follow suit.

Establishing Program Specifics

In order to clearly state a hypothesis, we need to define some things a bit better. We need something to serve as a proper measure of the extent to which a security that we paper-trade in the past using our scheme would have rewarded us via the “more likely to continue going up than not” claim. Let's use for that “Sharpe's ratio”, a kind of risk-adjusted return. It is described, in Sharpe's own words, here, about midway down in the left-hand column, in the block quote. So we'll use Sharpe's ratio to assess whether momentum and relative strength pay off or not. A high Sharpe's ratio would be a good payoff but we'll be interested in comparing the Sharpe's ratio that we get to that of a benchmark.

Given that we have a list of stocks from which we are choosing n of them to hold, we could construct a benchmark by compiling data on what would happen if available funds were allocated evenly to each of the securities on the list— almost a buy-and-hold approach, but with periodic rebalancing to ensure that the funds remained approximately equally allocated among the issues. We'll call that a “Rebalanced Benchmark” on charts and tables below.

We could assess the momentum of the stock on a given day as simply the price on that day divided by the price, say, a year earlier. We need a name for that interval of time between the trailing day and the current day. It is our “lookback period”, denoted by lp. Using a simple price ratio like that has been a popular option, though obviously other fairly simple measures come to mind immediately— e.g., the difference between the price and its trailing 200-day simple moving average, and any other number of days could be used in place of the 200. The greater the simple price ratio, the higher the momentum. We'll stick with the simple price ratio.

But why determine price momentum with respect to changes over any particular lookback period? Why 6 months rather than 12 months, or 3 months instead of 5 months? Isn't that parameter a variable that is analogous to the measures-of-additive-per-gallon parameter that was of concern to the painting contractor? The contractor chose a particular number of measures of additive that maximized the average quality score. Shouldn't we choose the lookback period so as to maximize the Sharpe ratio of our returns? One would think so; we will (at least with this example). OK. So we'll consider a range of lookback periods. We could go from a one-month lookback period to a 12-month period and everything in between (because the computer never becomes tired of numbers or computations).

And in place of “you should only own the stocks or funds that are going up with the greatest vigor” let's say that you should only hold the top n stocks or ETFs on your list and should generally maintain equal dollar amounts in each of those— where the top performers are determined by the simple price ratios and where n is to be determined by the walkthrough procedure and may be substantially less than N, the total number of stocks or ETFs on your list.

But we should also take into consideration that we might fear being in even the top n securities during steep downturns. Shouldn't we then be in cash? So, let's specify our program so as to also require the price momentum ratio of the security to exceed the return ratio on cash. If there aren't n securities with a price momentum ratio that qualifies then we'll substitute cash to make up the difference, so it would be possible for us to have fewer than n securities in our portfolio or even none. We'll use the 13-week US Treasury bill bank discount rate for the return on cash, which is often streamed under the symbol ^IRX or $IRX. And note that this is not quite the same as treating cash as an additional security because we would only allow a security to have either a zero or 1/nth share of equity at the time of reallocation but we would allow cash to be as much as 100% of equity if no security had a price momentum ratio in excess of the return ratio on cash.

We plan to vary the lookback period over a wide range. That's because it's a parameter that will affect the outcome and so we want the best value. But what about n? That too is a parameter. If we have a list of 10 securities shouldn't we try, say, n between 1 and 10 and vary it too, along with the lookback period, to get the best Sharpe ratio– try every possible combination of n and the lookback period? That's not too much for any self-respecting computer to handle. And this is not anything that we can avoid. We're supposed to be testing our hypothesis. So is our hypothesis that we should use an n that has historically performed subpar? Well, that would be strange. Let's use the best n. Therefore the backtesting must involve the walkthrough procedure varying the value of n to find the best performing value, just as for the lookback period lp.

But some will say that no responsible investor would put all of the eggs in one basket, so that perhaps we should have a minimum n. We could almost call that a policy decision, an override of the math. But actually it's rational. To the extent that we have a lengthy enough period of record of data that could be considered to be representative of the market as it now is then we could let n range as low as 1 and perhaps automatically find with our backtesting that it was indeed unfavorable to have n be 1 so that a higher n would turn out to be optimal, from the math alone. So if we impose a limit on how low n can go, we have done what? We have introduced another parameter, this one of a different kind, a range limit kind, and the question arises regarding whether or not we should vary it to see if it's a critical parameter whose choice could drastically affect our outcomes. We could likewise even worry about whether it's OK for us to have arbitrarily picked 12 months as the upper limit of the range of candidate lookback periods.

And that's how we end up with the idea that parameter ranges should perhaps be constrained, or even expanded from whatever first comes to mind, and that optimization of the risk-adjusted return— we shall use the Sharpe ratio for that from now on— should be the way to ultimately set range limits.

Another Haircut Example

This time the failure is to submit n, the maximum number of securities held, to the rigors of the walkthrough procedure. Figure 2 tells the tale— the false promise in blue is the result of fixing n after the fact, ex post, so as to maximize the cumulative return.

Walkthrough Results

Figure 2: This chart shows what $1 would have become when invested in a certain list of country funds using the walkthrough method, two different ways. The blue trace is the result of the walkthrough method applied in a too-constrained manner with an ex-post-determined maximum of 5 securities held out of the 10 on the list; the cyan trace is the result of the walkthrough method with the maximum number of securities held allowed to range from 1 to 10. For both runs trading was conducted only at the close on the first trading day of each month— using the best performing lookback period out of the 1-to-12-month range, selected through the use of a trailing exponential moving average with a 2-month characteristic period.


Better Than Best?

On the way to putting together the dismal news about the terrible haircut of Figure 1, a surprise happened. With another one of the portfolios, the International one (defined below), and with the same chosen initial parameters of Figure 1, the walkthrough procedure yielded a negative haircut with regard to the 1-to-12-month range of the lookback period lp. That is, with the walkthrough procedure using that entire range the Sharpe ratio was higher than that which would have been obtained with the constant use of the hindsight-determined best-performing lp. This is the opposite of what is depicted on Figure 1.

We were perhaps expecting, given the parable of the painting contractor, to find that the walkthrough method would always give us our comeuppance for having dared to suppose that we could take on the professionals. Instead, in these particular circumstances there was a rather extraordinary gain from its use over what would have been realized with any fixed choice of the lookback period, as if the algorithm were effectively adaptive in that way. But such a gain generally does not occur. It's therefore good that this article leads with the example of Figure 1, which shows a wicked marking down of the projected returns.

We could theorize that for particular circumstances giving the program a choice of lookback periods causes it to abandon the longer periods for the shorter ones whenever an abrupt and very substantial move happens, say, right at the start of a deep plunge or a sudden recovery. For example, just after a plunge has started the shorter lookback periods will be performing better because they will have quickly indicated that the scheme should go into cash. Thus the scheme, selecting as it does the best-performing lookback period on a trailing basis, will put the portfolio into cash rather quickly, and vice versa for sudden rallies. Otherwise, during strong, steady rising markets the longer lookback periods will be performing better because they will not be generating whipsaws every time that there is a little pullback, so the program will then help us avoid getting hurt by whipsaw losses as it will automatically take the advice, so to speak, of the better-performing longer lookback periods. That adaptability is not there if a single, fixed lookback period is used. Such inner workings as that, if they persistently occur, would be likely to take the form of the program helping us to avoid participation in bear markets while keeping us invested most of the time in bull markets.

But whatever the Sharpe ratio outcome of the walkthrough procedure, our reliance will be on the considerable extent to which the procedure faithfully simulates the use of our scheme in the past— upon that and the second haircut that we shall administer after we examine the extent to which the walkthrough-procedure-defining parameters that are optimal for this portfolio remain optimal when applied to other portfolios. We will see that it would be utterly wrong to suppose that the subsequent Monte Carlo analysis could somehow compensate for our failure to do a true simulation, the like of which we will come closer to bringing about when next below in this article, for Figure 3, we also submit the determination of the maximum number of securities held to the walkthrough procedure.

The Roadmap to Recover from This

So the score is that we have two brutal markdowns— Figure 1, and Figure 2 which is a buzz cut from the blue to the cyan trace— versus the one markup that is mentioned directly above. It's unavoidable that we permit the markdowns (and markups if we're awarded them) to happen as no program can be devised that can figure out in advance that setting n the maximum number of securities held to 5 or the lookback period lp to 11 as on Figure 1 would eventually work out to be the best.

But we have yet to conduct the optimization of the walkthrough-procedure-defining parameters— the range limits of n and lp and the characteristic period m of the trailing exponential moving average, that are used by the walkthrough procedure to select the best trailing n and lp values. The good news is that we'll find it to be appropriate to adjust the lower limit of the n range substantially upward from 1, the possible need of which is anticipated on the previous page, and when we do so there will be a resultant huge boost in performance. This development has implications regarding the very degree to which relative strength should be relied upon when the simple and popular price ratio is used to measure the amount of momentum, so the exercise will be quite meaningful even for those who have no strong interest in algorithms.

The general plan here is that we started above, and will continue below for a bit, with non-optimized initial values of the walkthrough-procedure-defining parameters. They are presented so that we can see what happens when we adjust them toward optimization— mainly so that we can thereby discover the depth of that second haircut that we must administer as a final step, as is explained above. So next we'll summarize the results that are obtained with those initial values of the walkthrough-procedure-defining parameters, and then we'll go on to the optimizations.

Initial Outcomes, Before Optimization

We will be working with the following lists of ETFs, three of them, and from each a portfolio will be formed. Inasmuch as our program dynamically adjusts the position sizes in each security, and the size may be zero, at any one time only a few or perhaps even none of the securities may be actually in the portfolio.

  • International: SPY, MDY, EWA, EWC, EWG, EWH, EWJ, EWW, EWS and EWU. These are the ticker symbols of 10 famous and very liquid ETFs of a developed-nation flavor that are very popular with investors of every stripe and which have all traded for roughly 20 years. As such they represent pure liquidity, for these funds all hold big- or mid-capitalization stocks that are traded heavily in their own countries as well as internationally.
  • Mostly-USA: DIA, EEM, IWM, IWV, IYR, QQQ, RSP and SPY. This list is really a rather random assortment of 8 of the most popular ETFs held by investors in the United States. IYR holds REITs and the like; EEM covers emerging markets; the others are diversified US equity funds but note that IWM holds small-capitalization stocks.
  • Sector: XBI, XES, XHB, XLB, XLE, XLF, XLI, XLK, XLP, XLU, XLV, XLY, XME, XOP, XPH, XRT and XSD. These 17 ticker symbols are of industry sector funds. Unfortunately some of these were launched as recently as 2007. Hence the historical records only show performance through part of one crisis, the Lehman Brothers/subprime-housing debacle of 2008, and the chart below for this portfolio begins about a year after prices reached their peak. A few others of the X series are of even more recent vintage and have been omitted.

Cumulative Returns, Traded and Benchmark

Figure 3 shows the unoptimized outcomes at a glance, for all three portfolios. On it the cyan trace represents our full-blown “complete walkthrough” procedure selecting the best-performing n and lp on a trailing basis; for the magenta trace n was constrained, set equal to the total number of ETFs on the given list. Do notice that the magenta trace for the International portfolio really looks like something that anyone would love to own as it shows hardly any significant declines but finishes with a market-beating result.

Walkthrough Results

Figure 3: This shows unoptimized returns of portfolios formed from the three lists above. Traded Portfolio results are shown both for a complete walkthrough (cyan) that included variation of the maximum number n of securities held over the full range of 1 to N where N is the total number of ETFs on the list, and also for a walkthrough (magenta) with the maximum number of securities held fixed at N. The characteristic time period m of the trailing exponential average was 2 months and lp, the lookback period, was allowed to range from 1 to 12 months.

What does setting the maximum number of securities held equal to the number of securities on the list mean? It means giving up on relative strength, that's what it means, because with that choice we only select securities to hold based on whether or not their trailing return ratios exceed that of holding cash, without a focus on just owning the securities with the very best trailing performance— hence the label “Traded Portfolio (without relative strength)” for the magenta trace.

We won't dwell much on Figure 3, as those results are not finalized, but we can say that the Traded Portfolio traces do for the most part avoid the debacles (except for the embarrassment of the “complete walkthrough” with the International portfolio during the 2000-2003 dot-com crash). And the “without relative strength” plots are generally much less volatile than the others.

To be sure, we could not accept that sort of overall performance relative to that of the benchmark that we see in Figure 3. However, we have more to do to finalize our program— we have to deal with ex post setting of the range limit parameters and of the characteristic period m of the trailing exponential moving average. The finishing steps will boost the expected performance substantially beyond what we have seen thus far.

Sharpe Ratios and p Values

Earlier in this article the Sharpe ratio was introduced as a suitable measure of portfolio performance, traded or not. And in Part A we reviewed the concept of the “p value” as the odds that the results are as good as they look due entirely to chance, not to the successful exploitation of a systematic effect. We can compute the Sharpe ratio for any record, but obviously if our scheme produces a Sharpe ratio that is less than that of the benchmark there is no point in proceeding with an attempt to refute the null hypothesis because we already know that the null hypothesis has won.

The last two columns of the table below pretty well tell the story of our research thus far. We are still paused just now short of finalization but will tend to that next. It should be kept in mind that the Sharpe ratio has a rate of return in excess of that of cash in the numerator and a measure of volatility in the denominator. Hence at times the portfolio that finishes highest in dollars, meaning highest in the overall rate of return, may be graded using the Sharpe ratio as being not as desirable as another that finishes not quite as high but with a history of less volatility.

RB Portfolio Fig. n Sharpe Ratio p Value
Traded International 3 10 0.69 <0.01
Traded International 3 1-10 0.33 0.21
Benchmark International 3 all 0.27
Traded Mostly-USA 3 8 0.50 0.14
Traded Mostly-USA 3 1-8 0.47 0.11
Benchmark Mostly-USA 3 all 0.36
Traded Sector 3 17 0.62 0.18
Traded Sector 3 1-17 0.33 0.44
Benchmark Sector 3 all 0.46
Table 1: These are the statistics prior to optimization of the walkthrough-procedure-defining parameters— based on initial values that will be changed. With all traded portfolios the value of n is the maximum number of securities held, and 1/nth of the account equity is allocated to each security held. Remaining equity will be held in cash if the number of securities whose trailing return ratios exceed that of cash does not equal or exceed n; otherwise the equity will be allocated to the n securities with the highest trailing return ratios. Interest is paid on cash. The characteristic time period m of the trailing exponential average was 2 months and lp, the lookback period, was allowed to range from 1 to 12 months.

Comparing the cyan data to the magenta data on the table, the former pertain to our “complete walkthrough” runs and the latter to our “without relative strength” runs. Note especially that the latter beat the former. That's telling us something, something that we'll bear in mind when it comes time to set the lower range limit of n, the maximum number of positions held. We can also see that the p values for our complete walkthrough runs are all telling us that they are not good enough (we'd like p<0.05).

Optimizing the Walkthrough-Defining Parameters

One of the unsettling things about what we're doing is that ideally we'd like to have a tremendously long historical record for each of the securities that we want to trade, with all of it being representative of the way that the market is currently working. But of course that can't be. We generally follow ETFs from inception, which is at most 20 years or so ago. One can even argue that 20 years could possibly be too long as the amazing thing called “high-frequency trading” didn't exist back then and seems to have the market by the throat now. So what do we do?

Well, to a degree it seems that if we take a scheme that we have developed and tested on one set of securities and then apply it to another, a different set of securities of a not-too-different character, that we will be doing something that is in a way a kind of substitute for having a very long but relevant period of record for the targeted securities, as we would in that way likewise have the possibility of pitting the scheme against reams of data. But then we have to ponder the question of how similar in character the other securities would have to be. We can wonder whether or not our scheme should work with everything. We could very rationally suppose that a momentum and relative strength scheme that works for diversified equity funds in particular should not be expected to work as well even for stocks, much less for, say, sovereign debt, currencies, precious metals or cattle futures. Will we have to apply (gasp!) judgment?

As we go through the optimization process we'll be accumulating data on the three lists of securities: the International, the Mostly-USA and the Sector lists. They are all equities, ETFs at that.

Other Concerns

It's was hardly possible to have avoided having some parameters that must be set ex post. Furthermore, if you somehow don't have any scheme-defining parameters to set ex post via some sort of optimization, you still have a structure. Yes, we have simply picked a mathematical framework here. It's not a “genetic algorithm” or a “neural network”, is it? No, it's something else.

Picking a structure is possibly even more fateful and portentous than choosing a parameter value ex post: If you didn't try other structures, others did and you read about that and then decided to go with a particular choice, and that all happened after the data that you're backtesting with was originated— hence ex post. And there's always the chance that we, or they who chose those other frameworks, chose something that happened to have worked on the limited dataset on which backtesting was conducted, which scheme will never work as well again.

We have a rational basis for setting the walkthrough-procedure-defining parameters, and it's one with which we can get a rough idea of the reliability of the finished product. But the chosen structure is some­thing that we cannot be reasonably called upon to vary in any systematic way, and so it is not within the reach of optimization.

The Idea in Some Detail

It's time to go ahead and find the best performing values of the range limits and the characteristic time period m of the trailing exponential average over the entire period of record, ex post, and then further look at the degrees of sensitivity of the Sharpe ratio to variations from those optimal values and also at the extent to which the thus-determined values assume different values when determined using other datasets.

The concept here is that if the Sharpe ratio for the entire period of record fluctuates radically as we vary a parameter that we have set ex post about its optimal value, such as a range limit or the characteristic time period of the trailing exponential average, then that means that we could generate a wide range of outcomes of the simulation (but not repeatably in practice) by changing the parameter values just a bit. And of course if we're stupid we'll go ahead and tweak the parameter values, notwithstanding the volatility, until we get a high Sharpe ratio— failing to understand that high historical volatility with respect to parameter values implies high volatility of future outcomes with fixed parameter values, and fooling ourselves into thinking that we have a scheme that really works (a bit like the painting contractor all over again).

But conversely, if there is only an insubstantial variation of the Sharpe ratio as we vary range limits and the characteristic time period over reasonable ranges then our concern is minimized. Yes, we want insensitivity or non-volatility of the Sharpe ratio with respect to the values of our dangling, loose-end parameters and that's one circumstance with regard to which we can rationalize the acceptance of ex post settings of such parameters. If we don't get that then it simply means that the second haircut will have to be correspondingly deep.

More ideally we'd like to select the loose-end parameters so as to maximize the Sharpe ratio for the entire current time period under study, nearly 20 years for our International portfolio, and do likewise for numerous other time periods of like substantial duration— with the very same securities if only that were possible. We would presumably get somewhat different optimal walkthrough-procedure-defining parameter values for each time period and could then substitute each of them for the optimal values for the current time period and see what damage the deviations would do to the Sharpe ratio of the current time period. It would go lower, or at least never higher to be sure, because we had maximized the Sharpe ratio for the current period. However, as we discussed above, we don't have many decades of relevant data with respect to any one set of securities. That not being feasible, the resort that is carried out in this article is, in place of data from different time periods that pertain to the same list of securities, the use of contemporary data from several lists of similar kinds of securities. We find the optimal walkthrough-procedure-defining parameter values for each list of securities, and note the extent to which the Sharpe ratios for each of the lists of securities fluctuate as the parameter values are adjusted about their optimal values. We can then, from the variations in the optimal values, get a rough idea of how much the projected Sharpe ratio for each list of securities should be reduced— our second haircut.


Optimization Underway at Last

We'll now finally go about optimizing the walkthrough-procedure-defining parameters, the range limits and the characteristic period m of the trailing exponential average, that we use to select optimum n and lp parameter values in real time in the course of our walkthrough. What happens when these parameters' values are varied?

The Resolution for the Range Limits of n, the Maximum Number Held

Let's begin with varying the range limits of n, the maximum number of securities held— at first just for the International portfolio. For the complete walkthrough run with initial values, the results of which are shown on Figure 3, the range limits were the widest possible, 1 and 10, where there are 10 securities on the list.

We had considered that anyone managing funds might steer away from n being allowed to be very small. After all, a single ETF, even a country fund, might suffer a dramatic collapse (e.g., utter loss of computerized records, geopolitical catastrophe, etc.). And even if calculations were to show that allowing n to be very small produced superior results, it would be difficult to convince clients or customers that putting all of the eggs in so very few baskets was a good idea. So 3 was made the lower limit of the n range, leaving 10 as the upper limit. The Sharpe ratio improved, which improvement continued steadily as the lower limit was further raised until 5 was reached, after which the Sharpe ratio declined, albeit not very much. Note that if the lower limit is increased all the way to 10 then the run will be the same as for the magenta trace of Figure 3, which was still a quite good result.

That behavior with regard to the lower limit would appear to be rather consistent with the fact that fixing n at 5, with no variation, produced the best results, the stellar results (blue) of Figure 2 which were previously called into question. But the difference is that here by moving the lower limit to 5 months we are still allowing n to range over 5-10 months, not fixing n at 5. And we also confirmed that the resultant Sharpe ratio was not dependent upon the lower limit for n in a volatile way in the vicinity of n=5.

We might think that setting the lower limit of the range back to 1 and the upper limit to, say, 2, which is the complement of the 3-to-10 range, might be expected to produce results somewhat worse than those for the complete walkthrough run (cyan) of Figure 3 with n allowed to vary over the full range of 1-to-10. It's not really necessarily so, but indeed that was the outcome.

But the really big picture is that with the lower limit of n fixed at any of the possible values other than one or two the Sharpe ratio would not have been bad, not with this dataset. So we could simply adopt the attitude that at least with these particular kinds of securities we have some leeway to move the limits of the n range to meet other money management objectives.

With the Sector fund portfolio the circumstances are rather different, with only rather high values of n seeming to work. It was found that with the lower limit of n raised continuously from 1, the peak value for the Sharpe ratio was reached at 13 or 14, with the ratio declining only a little and with little volatility with still higher values for the lower limit of n. However at lower values of the lower limit of n the ratio was rather volatile with respect to small changes in n and was often about half the peak value. So we would not be able to accept a low range limit for n for the Sector portfolio, and the lower volatility that is associated with the highest n values suggests that the use of a high lower limit may be appropriate. With the lower range limit set to 14, and with 17 remaining the upper limit, the Sharpe ratio was 0.67— somewhat better than for n fixed at 17 as on the table above and for the magenta trace on Figure 3 for the Sector portfolio. (But we're not through optimizing yet.)

Academic Opinion— It Changes

A type-II error occurs when the null hypothesis is in fact false and should be rejected, but the rule's performance is too low to do so. That is to say, the rule really does have predictive power, but it's virtue remains undetected because it experienced bad luck during during the back test. The data miner winds up leaving real TA [technical analysis] gold in the ground.

The ability of a hypothesis test to avoid making a type-II error is referred to as the test's power, not to be confused with the rule's predictive power. Here we are talking about the power of a statistical test to detect a false H0 [null hypothesis].


As pointed out by economist Peter Hansen, 45 WRC [White's reality check] and, by extension, the MCP [Monte Carlo permutation] would both suffer loss of power (increased likelihood of making a type-II error) when the data-mined universe [our momentum scheme with various sets of parameter values] contains rules [our momentum scheme with particular parameter values] that are worse than the benchmark.

— David Aronson, “Evidence-Based Technical Analysis”

The professors have been working on these considerations for some time, as is implied by the quotation above from David Aronson's book. I have added the clarifications in square brackets which draw parallels between the meaning of the quotation and the content of this article. By “rule” he means something like our momentum scheme with a particular set of parameter values. But note that the academicians too are still hashing out the fundamentals. They do not, for the most part, engage in walkthrough simulations per se but use other procedures that may or may not be more rigorous and usually don't leave us with a ready-to-go adaptive system, whereas our walkthrough approach is adaptive as the selected parameters such as n and the lookback period are determined dynamically in real time. Note especially that the discussion above about the lower limit of n for the Sector portfolio illustrates how we have allowed for systematic exclusion of bad parameter values, “rules that are worse than the benchmark” in Aronson's language.

The Resolution for the Characteristic Period m of the EMA

A trailing exponential moving average (EMA) weights more recent data more heavily than older data, and in the runs presented above the characteristic period of the EMA of the monthly return was set to m=2 and used by the walkthrough procedure to determine the n and lp parameter values that were optimal on a trailing basis. For example, with m=2 and monthly data the natural logarithm of the month-over-month return ratio of the prior month will be weighted by a third as much as that of the current month, and for the month before the prior month the weight will be one-ninth of the weight for the current month— with the weight for each month going back in time down by a multiplicative factor of a third each month. With a m=9 the weight for each monthly return is four-fifths of that for the following month, so that the roll-off is much less. The multiplicative factor is computed as (m-1)/(m+1) where m is the characteristic period of the EMA— that's in accordance with the standard definition of an EMA of characteristic period m. So now we go on to consider the effect of trying characteristic periods m for the EMA other than 2 months.

Note that although the computation of the EMA begins at the very start of the record, if the characteristic period of the EMA is m months then it will take something like m months before the EMA assumes a value that could not be substantially different had data earlier than the start of the record been available. Hence we do not determine positions in securities that are to be taken using such an EMA until m months have passed.

Given the discussion above that concluded that the lower limits of the n ranges should be adjusted upward, let's proceed first with the International portfolio with n confined to the range of 5 to 10. We do not really expect to find that the characteristic period of the EMA that maximized the Sharpe ratio with the full range of n of 1 to 10 will now still maximize the ratio with n varying from 5 to 10, and so we'll see.

Varying the characteristic period m from 1 month up to 6 months with n in the new and smaller range and the best performing lookback period still in the 1-to-12-month range produced Sharpe ratios that tended to be rather flat, varying little throughout the entire range of variation of m— except that the ratio for m=2 was about 20% higher than the ratios for 1 and 3, and there was a slow trailing off for m greater than 3 with the ratio not dropping below about 1.7 times that of the benchmark.

Turning now to the Sector portfolio (cf. Figure 5), with n now in the range 14-to-17, the findings regarding the Sharpe ratio dependencies on the choice of the characteristic period m of the EMA are qualitatively similar to the International portfolio results. For m in the range of 1 to 6, the Sharpe ratio was above 1.4 times the benchmark except for m=6, with which it was 1.3 times the benchmark. At m=1 it was 1.7 times the benchmark and 20% higher than the ratio at m=2. This behavior is again rather favorable, due to the general lack of volatility, especially if m is kept away from the low end of the initial range, and due to the ratio staying substantially higher than that of the benchmark. And the peak Sharpe ratio occurred at almost the same m value as with the International portfolio.

We'll keep in mind that variations as large as 20% have happened with these tests of the effect of varying m, the characteristic period of the EMA. They are bigger than other variations that were observed, and we'll act on the information further below. And of great interest is the fact that almost none of the variations of the Sharpe ratios that follow from changes in the m value were due to changes in the volatility of the excess return; the effects were all on the annual excess return— the numerator of the Sharpe ratio, not the denominator. We'll also take that into consideration when administering the second haircut.

So for the next step, investigating the range limits of the lookback period, for the International portfolio we'll continue to use the 5-to-10 range for n, the maximum number of securities held; for the Sector portfolio we'll continue to use 14 to 17 as the range. And for both we'll now set the characteristic period of the EMA at— surprise— 5 months. It's not the utter optimum for either the International or the Sector portfolio with the lookback period range limits that we've started working with. But we're going to find that it works well, close to optimal, with the adjusted range limits that we're about to settle on next. (And we'll get to the Mostly-USA portfolio for all of these settings a bit further below.)

The Resolution for the Range Limits of the Lookback Period lp

With those settings the lower range limit of the lookback period was first systematically varied upward with the upper limit fixed at 12; then the lower limit was fixed at 1 while the upper limit was first set to 1 and then systematically raised. We'll start with the International portfolio.

The Sharpe ratio increased rather smoothly and continuously as the lower limit was raised from 1 to 4, the value at which the ratio peaked, and then declined just as evenly as the lower limit was further raised to 12 months. The final run with the lower and upper limits of the lookback period fixed at 12 months produced the lowest Sharpe ratio of the series but it was nonetheless 1.7 times that of the Rebalanced Benchmark.

Then, as the upper limit was started at 1 and increased to 12, except for a dip of about 15% when the upper limit was at 3 months the Sharpe ratio was very flat, varying little from an average of about 1.5 times the benchmark except when the upper limit was 12 months. That produced a ratio of 1.8 times the benchmark.

But what about increasing the upper range limit beyond 12 months? With 1 remaining the lower limit, there was a slight increase in the Sharpe ratio out to 14 months and a slow and steady decline started thereafter.

Turning now to the Sector portfolio, with n in the range 14-to-17, as the lower lookback period range limit was increased from 1 to 12 months with the upper limit set at 12 months the Sharpe ratio followed exactly the same pattern as for the International portfolio: rising until reaching a peak when the lower limit was at 4 months and declining thereafter, steadily, and to a value of 1.6 times the benchmark when the lower limit was equal to the upper limit and was 12 months.

Proceeding further, still with the Sector portfolio, with the upper range limit starting at 1 and then increasing to 14 months, the Sharpe ratio of the traded portfolio increased rather continuously from 1.2 times the benchmark to 1.5 times it— but for an anomalous jump when the upper limit was increased to 2 months, with which range the Sharpe ratio was 1.6 times the benchmark.

So, in all, the tests of altering the lookback period range limits showed modest volatility of the Sharpe ratio with respect to the choice of limits, mostly steady changes, and no big advantage to altering the limits much from the 1-to-12-month range of the tests above. However, the indications are that using instead, say, 4-to-14 months may be a slightly better arrangement, especially since the higher lower limit of the lookback period range may tend to reduce the frequency of trading. So we'll adopt that range for the calculations for the final figure and table of this article, which appear on the next page.

The Resolution for the Mostly-USA ETF Portfolio

To avoid tedium the behavior of the Sharpe ratio with respect to various settings of the walkthrough-procedure-defining parameters for this portfolio was not reported on above as for the other portfolios but is described here and now and in an abbreviated way. It suffices to say that the same 4-14 month range for the lookback period lp and the same 5-month setting for the characteristic period m of the trailing EMA look to be quite appropriate for this portfolio as well. The Sharpe ratio tendencies with respect to variations of the range limits and the characteristic period were much the same as for the International and Sector portfolios. Also, setting the lower limit of the range of n, the maximum number of securities held, to 5 produced a maximized Sharpe ratio.

The Worse That Can Happen

In addition to implementing the walkthrough procedure, the p value calculations and taking the aforementioned care with the disposition of the walkthrough-procedure-defining parameters, is there anything else that we have done that might limit the risk that we take if we use this computerized portfolio management program? What's the worse that can happen? Well, a scheme that would have worked so very well in the past could in the future be found to work worse than buy-and-hold investing or being in cash. Yes, that can happen.


A more likely failure would be that the monthly trading that our program requires becomes merely ineffectual, that in time the null hypothesis rears its ugly, nihilistic head and turns the scheme into nothing better than a random number generator. With that scenario what should we then expect as our likely outcome. Well, with the scheme that we're analyzing we'd be holding securities in our account, long positions only, and from time to time some cash— meaning that we'd be less exposed to risk but also that we'd forfeit some of the gains, if any, that would follow from pure buy-and-hold investing.

Since in the long run the equities markets go up, our long-term outcome would likely involve a smaller rate of return but at less volatility. That might well mean that our long-term Sharpe ratio would be similar to that of the buy-and-hold approach, which is not too bad an outcome (but we'd have to watch the transaction costs, which could be more problematical due to the reduced return, and it would have been quite regrettable to have done all of the trading for no improvement). Note especially that the sufferable prospect that has just been described would come about only because our momentum program is specifically limited to long positions only; if we were to also do short sales during down markets then if our program were to degrade into randomness the long-term expectation for returns would be about nil, the short positions proving to be losers more often than not.

Other Costs of Trading

You may have also read this other article on this website on the subject of significant costs of trading that innately accrue if the trading is done in a stop-loss fashion— selling when the market declines and hoping to buy back at the right price on the way back up.

So it needs to be understood that while in this article we have neglected to account for commissions, other fees, slippage on executions and the bid-ask spread, we have not failed to include those other costs that accrue automatically with trading in a stop-loss style, a style that is not altogether divorced from what we're doing here with the simple momentum scheme based on trailing return ratios.

That is, our scheme has us selling when the market goes down and buying when it goes back up, at least vaguely like the examples of that other article on pure stop-loss trading. However, our mathematics automatically includes whatever costs of that kind accrue in their entirety, and where we have shown high Sharpe ratios we are handily beating them. It's not especially likely that the restriction that we have imposed with this article to trading only once a month assisted us in avoiding those other costs (the computer was programmed to trade at the close on the first trading day of every month if indicated). We normally expect the per-trade innate costs of stop-loss trading to be proportionately greater with a lower frequency of trading, so we cannot conclude that where we were successful it was made possible by our modest frequency of trading.

Notes and Finalized Results

We've tackled a lot of important considerations in this article. In no particular order, here are some important ones to keep in mind.

  • This article is about the general problem of how do we test a dynamical portfolio management program to see if it is reliable or not, if we can have confidence in the projected returns.
  • Apart from the testing principles that are put forth, all of the findings herein depend on the fact that we have considered a particular type of momentum scheme— one simply based on price ratios over a trailing lookback period. With each variation tested here no security was held if its price ratio was less than the like ratio for cash compounded with interest. Only long positions were taken, no short sales. With all of the calculations interest was paid on cash held, at the going rate implied by the 13-week Treasury bill discount rate, and the returns that were used for the Sharpe ratio calculations were returns net of those on cash. Trades were conducted only at the close on the first trading day of each month but the computations of the positions that were to be assumed at the close were based on opening prices on the first trading day of each month.
  • Other schemes would produce other results, so this article's findings don't represent the best that can be shown in support of the advisability of questing for good computerized ways for managing of a portfolio of investments. Really the momentum approach was selected as an example because it has some popularity and some academic support. Another Retail Backtest program that is now in testing performs better but this one has the advantage of a more modest frequency of trading and it would have easily and handily avoided the last two market plunges, the dot-com crash of 2000 and the Lehman Brothers/subprime-housing crisis of 2008.
  • It was found that the relative strength approach, which involves limiting the positions held to only the very best performers on a trailing basis, should be applied in moderation and that it's possible to legitimately tailor the degree of moderation to the chosen list of eligible securities somewhat, subject to some limitations, in a way that is responsive to the character of the securities as a class (e.g., International, Sector, etc.). In this article moderation was imposed by increasing the lower limit of the permitted range of n values, where n is the maximum number of securities held. As with the other findings, this one is specific to the momentum scheme that was tested; with other programs relative strength may or may not generally work well if not applied in moderation.
  • Should you make use of this program with your own investments? Investment advice is not offered here— just analyses with fulsome disclosures about the results of hypothesis testing.
  • Note that the walkthrough method that is employed here, with its wonder­ful way of giving naïve projections of rosy returns a “haircut”— the blue trace of Figure 2 was superseded by the cyan trace— and the computation of the chance that the thus-shorn returns are due in reality to chance don't overtly deal with the problem of our having simply chosen a scheme of a particular structure, which is an ex post choice (and we hate those because they make us look like the hapless, clueless painting contractor of the first page of this article). To deal with that problem some have theorized about alternative hypothesis testing methods, near counter­parts to which we could implement here by somehow coming up with a “universe” of other schemes having different structures that could hypothetically be tried during the walkthrough just as we instead tried just our own scheme with different parameterizations. In theory that might administer a bigger haircut, but it's not a feasible approach given that the number of such other schemes is truly infinite and given that it's hardly understandable that even partial use of such a procedure would not generate a “type-II” error as defined in the David Aronson quotation in the right-hand column of the previous page. In lieu of that, in this article the emphasis has been on running as true a possible a simulation of the performance of the program in the past.
  • The imperfections of the simulation take the form of unavoidable static ex post settings of the n- and lp-parameter range limits and of the characteristic period m of the EMA that are used during the walkthrough to select the n and lp parameter values with the best trailing performance. Those static ex post settings of the walkthrough-procedure-defining parameter values are then tested regarding the universality of their applicability by investigating the sensitivity of the Sharpe ratio to them and the sensitivity of their optimal values to the makeup of the list of eligible securities. That testing is the basis of the second haircut, which we will assume below to amount to a 20% reduction of the annual excess return, the return in excess of that on cash. The haircut was applied in such a way as to not reduce the volatility of the cumulative returns. Estimating the size of this second haircut and applying it is now called “suboptimization”, in these pages of this website.

Charts and a Table

And now let's see how our results were affected by the adjustments of the previous page to the characteristic period m of the trailing exponential average and also to the range limits for the lookback period lp and for n, the maximum number of securities held— and by administering that second haircut.

Walkthrough Results

Figure 4: Here are shown outcomes, from the top panel to the bottom, respectively for the same portfolios of Figure 3 on page 2 but now with optimized walkthrough-procedure-defining parameters. For all of the runs the lookback period was in the range of 3 to 14 months and the characteristic period of the trailing EMA that was used to dynamically select the lookback period and the maximum number of securities held was 5 months. The lower limits of the range of n, the maximum number of securities held, for the International, Mostly-USA and Sector portfolios were respectively 5, 5 and 14 months. The annual returns in excess of those on cash were reduced by 20% as a second haircut, as is discussed in the text above. Table 2 below summarizes the pertinent statistics for these runs.

RB Portfolio Fig. n Sharpe Ratio p Value
Traded International 7 5-10 0.51 <0.01
Benchmark International 7 all 0.27
Traded Mostly-USA 7 5-8 0.63 <0.01
Benchmark Mostly-USA 7 all 0.36
Traded Sector 7 14-17 0.77 0.01
Benchmark Sector 7 all 0.46
Table 2: With all traded portfolios the value of n is the maximum number of securities held, and 1/nth of the account equity is allocated to each security held. Remaining equity will be held in cash if the number of securities whose trailing return ratios exceed that of cash does not equal or exceed n; otherwise the equity will be allocated to the n securities with the highest trailing return ratios. Interest is paid on cash. This table incorporates a 20% haircut of each of the annual excess returns, which amounts to like reductions of the Sharpe ratios.

The table above summarizes our results after finalization of the scheme. Note especially that all of the traded portfolios sport good Sharpe ratios that substantially exceed the benchmark. And the p values show hardly a hint of the null hypothesis.

It is mainly the fact that we set lower range limits for n, the maximum number of securities held, that were not low— we had used n=1 before— that caused our “complete walkthrough” results to improve, besting the Sharpe ratios of the benchmark and also of the walkthrough when no relative strength was used at all. In other words, moderation with regard to the application of relative strength seems to work better than either full use of it or utter abandonment of it. Other improvement happened as the natural result of adopting ex post lookback period range limits and a characteristic period for the trailing exponential average that were rather close to being simultaneously optimal for all three portfolios.

This research will be continued. Although the discourse above seems to be at least conceptually complete— every grim reality of hypothesis testing is dealt with in some way— there is clearly work yet to do if anyone wants to use the scheme to manage portfolios of substantially different composition, or if modification of the scheme is required to meet particular objectives. Interested parties are encouraged to write to see if they can be assisted.

— Mike O'Connor

Comments or Questions: write to Mike. Your comment will not be made public unless you give permission. Corrections are appreciated.

Update Frequency: Infrequent, as this article is for the purpose of showing certain principles of portfolio management program testing in action; it's not intended to show the current effectiveness of the program or state of the market. See the momentum items on the Performance menu for the current program performance.