How Random is the Market? Testing the Random Walk Hypothesis

A mainstay of academic research into the market is the Random Walk Hypothesis (RWH). This is the idea that market moves are random and follow a normal distribution that can be easily described using a concept borrowed from physics called Brownian Motion.

This makes the market mathematics manageable, but is it true? Is the market really random?

If it is, then there’s little point to trying to beat it. But if it isn’t, then there are repeatable patterns that can be algorithmically exploited.

Thankfully, the issue of randomness is very important for fields like cryptography, so it is well studied and there are statistical tests that we can apply to market data to investigate this.

We’re going to borrow a few standard tests for randomness and apply it to historical data to see just how randome the markets really are.

Measuring Market Randomness

There are a host of randomness tests that have been developed over the years which look at binary sequences to determine whether or not a random process was used to generate these values. Notably, we have test suites such as the Diehard TestsTestU01NIST tests and others that have been published over the years.

We could run a large battery of tests (maybe we’ll get to that in a future post)to test our market data, but for now, we’ll just select three tests to see how the RWH holds up: runs test, discrete Fourier Transform test, and the Binary Matrix Rank test from the NIST suite.

Runs Test

If the market truly is random, then we shouldn’t see any dependence on previous prices; the market being up today should have no impact on what it will do tomorrow (and vice versa).

The runs test can help us look this aspect of randomness. It works by looking at the total number of positive and negative streaks in a sequence and checking the lengths.

We’ll take our prices and make all positive price changes into 1s and negative changes into 0s, and keep this binary vector as X. We’ll set n as the number of observations we have (e.g. n = len(X)). Then, to implement the runs test, we take the following steps (adapted from section 2.3 of the NIST Statistical Test Suite):

1. Compute the proportion of 1s in the binary sequence:

\pi = \frac{\sum_j X_j}{n}

2. Check the value \pi against the frequency test. It passes if: \mid \pi - 1/2 \mid < \tau, where \tau = \frac{2}{\sqrt{n}}. If the frequency test is failed, then we can stop and we don’t have a random sequence and we’ll set our P-value to 0. If we pass, then we can continue to step 3.

3. Compute our test statistic V_n where:

V_n = \sum_{k=1}^{n-1} r(k) + 1

where r(k) = 0 if X_k = X_{k+1}, otherwise r(k) = 1. So if we have the sequence [0, 1, 0, 0, 0, 1, 1], then this becomes: V_n = (1 + 1 + 0 + 0 + 1 + 0) + 1 = 4

4. Compute our P-value where:

p = erfc\bigg( \frac{ \mid V_n - 2n \pi (1 - \pi) \mid}{2 \pi (1-\pi) \sqrt{2n}} \bigg)

Note that erfc is the complementary error function (given below). Thankfully, this is available in Python with scipy.special.erfc(z):


With all of that, we can now use our P-value to determine whether or not our sequence is random. If our P-value is below our threshold (e.g. 5%), then we reject the null hypothesis, which means we have a non-random sequence on our hands.

import numpy as np
from scipy.special import erfc

def RunsTest(x):
  # Convert input to binary values
  X = np.where(x > 0, 1, 0)
  n = len(X)
  pi = X.sum() / n
  # Check frequency test
  tau = 2 / np.sqrt(n)
  if np.abs(pi - 0.5) >= tau:
    # Failed frequency test
    return 0
  r_k = X[1:] != X[:-1]
  V_n = r_k.sum() + 1
  num = np.abs(V_n - 2 * n * pi * (1 - pi))
  den = 2 * pi * (1 - pi) * np.sqrt(2 * n)
  return erfc(num / den)

The NIST documentation gives us some test data to check that our function is working properly, so let’s drop that into our function and see what happens.

# eps from NIST doc
eps = '110010010000111111011010101000100010000101101' + \ 
x = np.array([int(i) for i in eps])

p = RunsTest(x)
H0 = p > 0.01
# NIST P-value = 0.500798
print("Runs Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
Runs Test
Fail to reject the Null Hypothesis (p=0.501) -> random sequence

We get the same P-value, so we can be confident that our implementation is correct. Note also that NIST recommends we have at least 100 samples in our data for this test to be valid (i.e. $n \geq 100$).

Discrete Fourier Transformation Test

Our next test is the Discrete Fourier Transformation (DFT) test.

This test computes a Fourier Transform on the data and looks at the peak heights. If their are too many high peaks, then it indicates we aren’t dealing with a random process. It would take us too far afield to dive into the specifics of Fourier Transforms, but check out this post if you’re interested to go deeper.

Let’s get to the NIST steps. We have data (x) and we need to set a threshold, which is usually 95% as inputs.

1. We need to convert our time-series x into a sequence of 1s and -1s for positive and negative deviations. This new sequence is called \hat{x}.

2. Apply discrete Fourier Transform (DFT) to \hat{x}:

\Rightarrow S = DFT(\hat{x})

3. Calculate M = modulus(S') = \left| S \right|, where S' is the first n/2 elements in S and the modulus yields the height of the peaks.

4. Compute the 95% peak height threshold value. If we are assuming randomness, then 95% of the values obtained from the test should not exceed T.

T = \sqrt{n\textrm{log}\frac{1}{0.05}}

5. Compute N_0 = \frac{0.95n}{2}, where N_0 is the theoretical number of peaks (95%) that are less than T (e.g. if n=10, then N_0 = \frac{10 \times 0.95}{2} = 4.75).

6. Compute the P-value using the erfc function:

P = erfc \bigg( \frac{\left| d \right|}{\sqrt{2}} \bigg)

Just like we did above, we’re going to compare our P-value to our reference level and see if we can reject the null hypothesis – that we have a random sequence – or not. Note too that it is recommended that we use at least 1,000 inputs (n \geq 1000) for this test.

def DFTTest(x, threshold=0.95):
  n = len(x)
  # Convert to binary values
  X = np.where(x > 0, 1, -1)
  # Apply DFT
  S = np.fft.fft(X)
  # Calculate Modulus
  M = np.abs(S[:int(n/2)])
  T = np.sqrt(n * np.log(1 / (1 - threshold)))
  N0 = threshold * n / 2
  N1 = len(np.where(M < T)[0])
  d = (N1 - N0) / np.sqrt(n * (1-threshold) * threshold / 4)
  # Compute P-value
  return erfc(np.abs(d) / np.sqrt(2))

NIST gives us some sample data to test our implementation here too.

# Test sequence from NIST
eps = '110010010000111111011010101000100010000101101000110000' + \
x = np.array([int(i) for i in eps])
p = DFTTest(x)
H0 = p > 0.01
print("DFT Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
DFT Test
Fail to reject the Null Hypothesis (p=0.646) -> random sequence

Same as the NIST documentation, we reject the null hypothesis.

Binary Matrix Rank Test

We’ll choose one last test out of the test suite – the Binary Matrix Rank Test.


1. Divide the sequence into 32 by 32 blocks. We’ll have N total blocks to work with and discard any data that doesn’t fit nicely into our 32×32 blocks. Each block will be a matrix consisting of our ordered data. A quick example will help illustrate, say we have a set of 10, binary data points: X = [0, 0, 0, 1, 1, 0, 1, 0, 1, 0] and we have 2×2 matrices (to make it easy) instead of 32×32. We’ll divide this data into two blocks and discard two data points. So we have two blocks (B_1 and B_2) that now look like:

B_1 = \begin{bmatrix}0 & 0 \\0 & 1\end{bmatrix} B_2 = \begin{bmatrix}1 & 0 \\1 & 0\end{bmatrix}

2. We determine the rank of each binary matrix. If you’re not familiar with the procedure, check out this notebook here for a great explanation. In Python, we can simply use the np.linalg.matrix_rank() function to compute it quickly.

3. Now that we have the ranks, we’re going to count the number of full rank matrices (if we have 32×32 matrices, then a full rank matrix has a rank of 32) and call this number F_m. Then we’ll get the number of matrices with rank one less than full rank which will be F_{m-1}. We’ll use N to denote the total number of matrices we have.

4. Now, we compute the Chi-squared value for our data with the following equation:

\chi^2 = \frac{(F_m-0.2888N)^2}{0.2888N} + \frac{(F_{m-1} - 0.5776N)^2}{0.5776N} + \frac{(N - F_m - F_{m-1} - 0.1336N)^2}{0.1336N}
  1. Calculate the P-value using the Incomplete Gamma Function, Q\big(1, \frac{\chi^2}{2} \big):
P = Q \bigg(1, \frac{\chi^2}{2} \bigg) = \frac{1}{\Gamma(a)} \int_x^{\infty} t^{a-1} e^{-1} = e^{\frac{\chi^2}{2}}

Scipy makes this last bit easy with a simple function call to scipy.special.gammaincc().

Don’t be intimidated by this! It’s actually straightforward to implement.

from scipy.special import gammaincc

def binMatrixRankTest(x, M=32):
  X = np.where(x > 0, 1, 0)
  n = len(X)
  N = np.floor(n / M**2).astype(int)
  # Create blocks
  B = X[:N * M**2].reshape(N, M, M)
  ranks = np.array([np.linalg.matrix_rank(b) for b in B])
  F_m = len(np.where(ranks==M)[0])
  F_m1 = len(np.where(ranks==M - 1)[0])
  chi_sq = (F_m - 0.2888 * N) ** 2 / (0.2888 * N) \
    + (F_m1 - 0.5776 * N) ** 2 / (0.5776 * N) \
    + (N - F_m - F_m1 - 0.1336 * N) ** 2 / (0.1336 * N)
  return gammaincc(1, chi_sq / 2)

If our P-value is less than our threshold, then we have a non-random sequence. Let’s test it with the simple example given in the NIST documentation to ensure we implemented things correctly:

eps = '01011001001010101101'
X = np.array([int(i) for i in eps])
p = binMatrixRankTest(X, M=3)
H0 = p > 0.01
print("Binary Matrix Rank Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
Binary Matrix Rank Test
Fail to reject the Null Hypothesis (p=0.742) -> random sequence

And it works! Note that in this example, we have a much smaller data set, so we set M=3 for 9-element matrices. This test is also very data hungry. They recommend at least 38 matrices to test. If we’re using 32×32 matrices, then that means we’ll need 38x32x32 = 38,912 data points. That’s roughly 156 years of daily price data!

Only the oldest companies and commodities are going to have that kind of data available (and not likely for free). We’ll press on with this test anyway, but take the results with a grain of salt because we’re violating the data recommendations.

Testing the RWH

With our tests in place, we can get some actual market data and see how well the RWH holds up. To do this properly, we’re going to need a lot of data, so I picked out some indices with a long history, a few old and important commodities, some of the oldest stocks out there, a few currency pairs, and Bitcoin just because.

Data from:

  • Dow Jones
  • S&P 500
  • Gold
  • Oil

One thing to note as well, we want to also run this against a baseline. For each of these I’ll be benchmarking the results against NumPy’s binomial sampling algorithm, which should have a high-degree of randomness.

I relied only on free sources so you can replicate this too, but more and better data is going to be found in paid subscriptions. I have defined a data_catalogue as a dictionary below which will contain symbols, data sources, and the like so our code knows where to go to get the data.

data_catalogue = {'DJIA':{
    'source': 'csv',
    'symbol': 'DJIA',
    'url': '^dji&i=d'
    'S&P500': {
        'source': 'csv',
        'symbol': 'SPX',
        'url': '^spx&i=d'
    'WTI': {
        'source': 'yahoo',
        'symbol': 'CL=F',
    'Gold': {
        'source': 'yahoo',
        'symbol': 'GC=F',
    'GBP': {
        'source': 'yahoo',
        'symbol': 'GBPUSD=X'
    'BTC': {
        'source': 'yahoo',
        'symbol': 'BTC-USD'

Now we’ll tie all of this together into a TestBench class. This will take our data catalogue, reshape it, and run our tests. The results are going to be collected for analysis, and I wrote a helper function to organize it into a large, Pandas dataframe for easy viewing.

import pandas as pd
import pandas_datareader as pdr
import yfinance as yf
from datetime import datetime

class TestBench:

  data_catalogue = data_catalogue

  test_names = ['runs-test',

  def __init__(self, p_threshold=0.05, seed=101, 
               dftThreshold=0.95, bmrRows=32):
    self.seed = seed
    self.p_threshold = p_threshold
    self.dftThreshold = dftThreshold
    self.bmrRows = bmrRows
    self.years = [1, 4, 7, 10]
    self.trading_days = 250
    self.instruments = list(self.data_catalogue.keys())

  def getData(self):
    self.data_dict = {}
    for instr in self.instruments:
        data = self._getData(instr)
      except Exception as e:
        print(f'Unable to load data for {instr}')
      self.data_dict[instr] = data.copy()
    self.data_dict['baseline'] = np.random.binomial(1, 0.5, 
      size=self.trading_days * max(self.years) * 10)

  def _getData(self, instr):
    source = self.data_catalogue[instr]['source']
    sym = self.data_catalogue[instr]['symbol']
    if source == 'yahoo':
      return self._getYFData(sym)
    elif source == 'csv':
      return self._getCSVData(self.data_catalogue[instr]['url'])
    elif source == 'fred':
      return self._getFREDData(sym)

  def _getCSVData(self, url):
    data = pd.read_csv(url)
    close_idx = [i 
      for i, j in enumerate(data.columns) if j.lower() == 'close']
    assert len(close_idx) == 1, f"Can't match column names.\n{data.columns}"
      std_data = self._standardizeData(data.iloc[:, close_idx[0]])
    except Exception as e:
      raise ValueError(f"{url}")
    return std_data

  def _getYFData(self, sym):
    yfObj = yf.Ticker(sym)
    data = yfObj.history(period='max')
    std_data = self._standardizeData(data)
    return std_data

  def _getFREDData(self, sym):
    data = pdr.DataReader(sym, 'fred')
    data.columns = ['Close']
    std_data = self._standardizeData(data)
    return std_data

  def _standardizeData(self, df):
    # Converts data from different sources into np.array of price changes
      return df['Close'].diff().dropna().values
    except KeyError:
      return df.diff().dropna().values

  def runTests(self):
    self.test_results = {}
    for k, v in self.data_dict.items():
      self.test_results[k] = {}
      for t in self.years:
        self.test_results[k][t] = {}
        data = self._reshapeData(v, t)
        if data is None:
          # Insufficient data

        self.test_results[k][t]['runs-test'] = np.array(
          [self._runsTest(x) for x in data])
        self.test_results[k][t]['dft-test'] = np.array(
          [self._dftTest(x) for x in data])
        self.test_results[k][t]['bmr-test'] = np.array(
          [self._bmrTest(x) for x in data])

        print(f"Years = {t}\tSamples = {data.shape[0]}")

  def _reshapeData(self, X, years):
    d = int(self.trading_days * years) # Days per sample
    N = int(np.floor(X.shape[0] / d)) # Number of samples
    if N == 0:
      return None
    return X[-N*d:].reshape(N, -1)

  def _dftTest(self, data):
    return DFTTest(data, self.dftThreshold)

  def _runsTest(self, data):
    return RunsTest(data)

  def _bmrTest(self, data):
    return binMatrixRankTest(data, self.bmrRows)

  def tabulateResults(self):
    # Tabulate results
    table = pd.DataFrame()
    row = {}
    for k, v in self.test_results.items():
      row['Instrument'] = k
      for k1, v1 in v.items():
        row['Years'] = k1
        for k2, v2 in v1.items():
          pass_rate = sum(v2>self.p_threshold) / len(v2) * 100
          row['Test'] = k2
          row['Number of Samples'] = len(v2)
          row['Pass Rate'] = pass_rate
          row['Mean P-Value'] = v2.mean()
          row['Median P-Value'] = np.median(v2)
          table = pd.concat([table, pd.DataFrame(row, index=[0])])
    return table

We can initialize our test bench and call the getData() and runTests() method to put it all together. The tabulateResults() method will give us a nice table for viewing.

When we run our tests, we have a print out for the number of years and full samples of data we have. You’ll notice that for some of these (e.g. Bitcoin) we just don’t have a great amount of data to go off of, but we’ll do our best with what we do have.

tests = TestBench()
Years = 1	Samples = 129
Years = 4	Samples = 32
Years = 7	Samples = 18
Years = 10	Samples = 12
Years = 1	Samples = 154
Years = 4	Samples = 38
Years = 7	Samples = 22
Years = 10	Samples = 15
Years = 1	Samples = 21
Years = 4	Samples = 5
Years = 7	Samples = 3
Years = 10	Samples = 2
Years = 1	Samples = 20
Years = 4	Samples = 5
Years = 7	Samples = 2
Years = 10	Samples = 2
Years = 1	Samples = 18
Years = 4	Samples = 4
Years = 7	Samples = 2
Years = 10	Samples = 1
Years = 1	Samples = 10
Years = 4	Samples = 2
Years = 7	Samples = 1
Years = 10	Samples = 1
Years = 1	Samples = 100
Years = 4	Samples = 25
Years = 7	Samples = 14
Years = 10	Samples = 10

We have 129 years of Dow Jones data, which gives us 12, 10-year samples and 154 years for the S&P 500 (the index doesn’t go back that far, but our data source provides monthly data going back to 1789). This is in contrast to most of our other values which have two decades or less.

To take a look at the results, we can run the tabulateResults() method, and do some pivoting to reshape the data frame for easier viewing.

table = tests.tabulateResults()
pivot = table.pivot_table(index=['Instrument', 'Years'], columns='Test')
samps = pivot['Number of Samples'].drop(['bmr-test', 'dft-test'], axis=1)
pivot.drop(['Number of Samples'], axis=1, inplace=True)
pivot['Number of Samples'] = samps

Let’s start with the baseline.

As expected, NumPy’s random number generator is pretty good, and it passes most of the tests without issue. The median P-values for the runs and DFT tests remain fairly high as well, although they are lower for the BMR test. Another thing to note, the 1 and 4 year BMR tests didn’t return any values because we were unable to complete a single 32×32 matrix with such small sample sizes. Overall, the lack of data for the BMR test makes the results here dubious (we could recalculate it with a smaller matrix size, but we’d need to recalibrate all of the probabilities for these different matrices).

The DFT test showed randomness for most cases in our test set. For what it’s worth, the P-values for our DFT tests of all sizes remained fairly high regardless of the sample size.

The runs test provides the most varied and interesting results.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
for i, instr in enumerate(tests.instruments):
  sub = table.loc[(table['Instrument']==instr) &
  plt.plot(tests.years, sub['Pass Rate'], label=instr, 
           c=colors[i], marker='o')

plt.ylabel('Pass Rate (%)')
plt.title('Runs Test Pass Rate for all Instruments')

The runs test tends to produce less random results as time goes on. The notable exception being our WTI data, which passes more tests for randomness over time. However, if we look at our P-values, we do see them falling towards 0 (recall, our null hypothesis is that these are random processes).

plt.figure(figsize=(12, 8))
for i, instr in enumerate(table['Instrument'].unique()):
  sub = table.loc[(table['Instrument']==instr) &
  plt.plot(tests.years, sub['Median P-Value'], label=instr, 
           c=colors[i], marker='o')
plt.title('Median P-Values for Runs Test for all Instruments')

We added the baseline to this plot to show that it remains high even as the time frame increases, whereas all other values become less random over time. We’re showing P-values here, which are the probabilities that the results are due to noise if the process we’re testing is random. In other words, the lower our values become, the less likely it is that we have a random process on our hands.

This downward sloping trend may provide evidence that supports the value of longer-term trading.

Jerry Parker, for example, has moved toward longer-term trend signals (e.g. >200 day breakouts) because the short term signals are no longer profitable in his system. Data is going to be limited, but it could be interesting to run this over multiple, overlapping samples as in a walk forward analysis to see if randomness in the past was lower during shorter time frames. Additionally, there are more statistical tests we could look at to try to tease this out.

Death of the Random Walk Hypothesis?

The evidence from these few tests is mixed. Some tests show randomness, others provide an element of predictability. Unfortunately, we can’t definitively say the RWH is dead (although I think it, and the theories it is based on, are more articles of academic faith than anything).

To improve our experiment we need more data and more tests. We also used a series of binary tests, although technically the RWH asserts that the changes in price are normally distributed, so statistical tests that look for these patterns could strengthen our methodology and lead to more robust conclusions.

If you’d like to see more of this, drop us a note at and let us know what you think!

Why you need more data than you think in your Backtest

How many years does it take before you can be confident in a trading strategy?

Does one great year mean you have a tremendous strategy? Does one bad year mean you should pack it up and try something else? How soon can you tell that a system is flawed and needs changing?

These aren’t easy questions, but they’re incredibly important to any investor, whether you’re systematic or not!

While we can’t give hard and fast rules – because there aren’t any – we can outline a series of principles based on your trading and explain why we try to provide as much, high-quality data as possible in our platforms.

How significant are your results?

Although it is not without its flaws, the Sharpe Ratio remains the standard measure of risk-adjusted returns. A high Sharpe means you have good returns against your baseline with little volatility, something most traders crave. A Sharpe of 0 indicates that your strategy isn’t providing any real value, and a negative Sharpe means you’re destroying value.

With that in mind, Rob Carver laid out an experiment whereby he asked how long would he have to run a daily strategy with a given Sharpe Ratio to determine whether or not it is profitable or just noise?

His results were surprising.

Results taken from Systematic Trading.

Lower Sharpe Ratio strategies took decades to distinguish from noise! Consider the implications.

Many retail traders don’t run proper backtests – even those trying to be systematic – so they jump into a strategy based on some vague ideas of what might/might not work from a guru, or message board, YouTube video, or who knows what. They start trading and maybe they’re doing well enough with a Sharpe Ratio of 0.4 after their first year. Does that mean they have an edge? Well, maybe! But they’re going to need to continue that for 32 more years before they can be sure!

The other thing to note is that strategies with higher Sharpe Ratios are easier to distinguish from noise. They stand out more, so if you have a system that turns in some excellent years (e.g. Sharpe > 1) then it’s likely that you’ve got something great.

If you’re trading daily signals, it seems that you’re going to want at least 20 years of history to properly test a strategy.

Test in multiple regimes

Never confuse genius with luck and a bull market

John Bogle

Markets go through bull and bear markets, moving up, down, and sideways; periods of high volatility and low volatility. If you build a strategy and only test it on one type of regime, then you’re setting yourself up for some nasty surprises when the market inevitably turns!

Extending your backtest and getting more historical data is key to ensuring that your strategy can hold up in these different environments.

What if you are trading something that doesn’t have a long history?

In a case like this, you have a few options:

  1. Find a comparable proxy that does have a long history to see how your model performs. Are there correlated instruments such as equities in the same industry (e.g. energy companies) or commodities with similar drivers (e.g. gold and silver)?
  2. Generate simulated data to test your ideas. This requires you to generalize from the statistics of the data you’re working with to simulate additional time series data to see how your strategy performs. An advantage of this approach is that you can tweak some of those statistics or create different trends and scenarios to build broader tests. Caution needs to be exercised in this approach because you may be fitting a system on something that has no link to reality.
  3. Don’t trade it.

Use High-Quality Data

For most of our tutorials, we rely on the yfinance Python API because it’s free, easy to use, and generally has reliable data. Except when it doesn’t.

In fact, a lot of free data sources are mostly reliable, but occasionally, you’ll run into big problems.

Take the first 5 years of this single-instrument backtest below:

The high-level statistics looked tremendous, too good to be true in fact. Plotting the equity curve shows why.

Early in the backtest, there’s a major jump in returns as a leveraged short position makes an absolute killing. Looking at the data, we see a (fictional) overnight drop of 94%.

Simply adjusting the starting point to begin after the anomaly shows that the strategy doesn’t add much above the baseline.

Data quality makes a big difference, but even paid sources aren’t perfect – although they tend to be much better.

To increase your confidence in your data quality, you could use multiple, independent sources to check for differences. If you have three sources, if two sources agree on a given price and one differs, take the value from the two. If all are different, then average the three and rely on a single source to fill in missing data if 2 out of the 3 are missing a value.

It won’t ensure your data is flawless, but will greatly reduce the odds of a data error being introduced.


Best practice dictates that only some of your data be used for fitting your parameters (e.g. tweaking lookback periods, stop loss levels, etc.) and the remaining be used for testing. The first portion is called in-sample data while the latter is out-of-sample.

The idea is that the out-of-sample data provides you a chance to see what your system is going to do on new data that it has not been calibrated to trade. The stats should be worse, but not significantly (unless you over-fit on your in-sample data). This is designed to give you a better estimate for future performance.

Unfortunately, this requires even more data to complete effectively.

Frequently you’ll see recommendations for a 70-80% of your data being used as in-sample data with the remaining 20-30% as out of sample data.

Another way to deal with this is by using cross-validation techniques like walk forward optimization. This allows you to optimize and test on subsets of your data and choose the best.

Long Data Bias

Can we ever have enough data?

On the extreme end, we have funds that go back to the 1880’s to better understand their strategies, or Renaissance Technologies which collected price data from the 1700’s.

How much is enough is going to depend on your goals and whether you’re really getting value from adding 1959 to your time series that already goes back to 1960. There is a law of diminishing returns that will eventually kick in for most investors.

Regardless, data is our raw material and we frequently need more of this resource than we think.

Your Free Data is Costing You Money

Garbage in. Garbage out. This old adage holds for all areas of decision sciences, including backtesting your investment strategies.

Years of working with financial data — and directly in the data industry — has revealed deep issues with many data providers, and especially the freely available sources. Errors are everywhere, which could make a great strategy look terrible, or worse, a losing strategy look highly profitable.

The Financial Data Pipeline

If you open up your favorite trading platform or brokerage account, you’re confronted with a series of quotes composed of red or green numbers updating by the second. For individual stocks and securities, they represent to price the most recent transaction was settled at. For indices (such as shown below) they are the aggregation of the most recent transactions of all of the securities that make up the index.

These transactions are recorded by the exchanges and sold to data providers, who in turn offer the data along with their APIs or software packages to traders, institutions, and others. There are a lot of free data providers out there as well, which is often where most algorithmic traders start.

Pitfalls of Free Data Sources

Free data is a great place to start — we use free data sources in our tutorials because it’s easy and accessible for people — but we would never trust our hard-earned cash to a strategy operating on an algorithm that relies on free data. There are a few reasons for this.

Many free data sources have limited histories. For a good algorithmic approach, we want as much data as possible, which means going back in time as far as possible so that we’re able to test our approaches against a wide-range of markets. 5 or 10 years of data just doesn’t cut it.

Free data sources may become obsolete or move to a premium model. If this happens, your algorithm is suddenly going to be cut off which could lead to missed trades. Most professional sources are loath to change their systems because their customers depend on consistent and reliable data feeds to build their businesses (this can be seen when sampling professional data systems and finding a lot of UIs that were clearly built for Windows 95…but they still work!).

Data inaccuracies are frequent. It’s hard to keep up with thousands of companies and their corporate changes, so stock splits, dividend payments, and the like which need to get propagated into historical data frequently get passed over. Additionally, rounding errors can compound the farther back in time you look.

Stock tickers often get re-purposed after de-listing and many free sources either don’t keep records of these de-listed stocks or only allow look-ups via the ticker. If this isn’t properly accounted for, then you could introduce survivorship bias into your backtests by only testing strategies against companies that have survived over the years. This has the effect of inflating your results and hiding risk.

Free-data stalwarts like Yahoo! Finance have gone through all of these issues, restricting data by changing business models; having APIs suddenly break with new updates; miscalculating dividends, splits, and the like; rounding payouts which leads to errors as data gets propagated into the past; and dropping de-listed stocks causing survivorship bias in backtests.

Professional Data Sources

This isn’t an ad for buying data from a vendor — triangulating multiple free data sources and making regular updates can help fix these issues, but that’s a lot of work that may be better spent doing research and running tests. It’s better to start with good, high quality data and build from there rather than spending heaps of time chasing down discrepancies, building scrapers, and patching APIs.

Let us handle that for you at Raposa Technologies where we’re building a platform to make quantitative investing easily available. We’ve done the hard work of vetting our data and vendors, giving you access to professional backtesting capabilities to build your own strategies that you can be confident in.