A mainstay of academic research into the market is the Random Walk Hypothesis (RWH). This is the idea that market moves are random and follow a normal distribution that can be easily described using a concept borrowed from physics called Brownian Motion.

This makes the market mathematics manageable, but is it true? Is the market really random?

If it is, then there’s little point to trying to beat it. But if it isn’t, then there are repeatable patterns that can be algorithmically exploited.

Thankfully, the issue of randomness is very important for fields like cryptography, so it is well studied and there are statistical tests that we can apply to market data to investigate this.

We’re going to borrow a few standard tests for randomness and apply it to historical data to see just how randome the markets really are.

Measuring Market Randomness

There are a host of randomness tests that have been developed over the years which look at binary sequences to determine whether or not a random process was used to generate these values. Notably, we have test suites such as the Diehard Tests, TestU01, NIST tests and others that have been published over the years.

We could run a large battery of tests (maybe we’ll get to that in a future post)to test our market data, but for now, we’ll just select three tests to see how the RWH holds up: runs test, discrete Fourier Transform test, and the Binary Matrix Rank test from the NIST suite.

Runs Test

If the market truly is random, then we shouldn’t see any dependence on previous prices; the market being up today should have no impact on what it will do tomorrow (and vice versa).

The runs test can help us look this aspect of randomness. It works by looking at the total number of positive and negative streaks in a sequence and checking the lengths.

We’ll take our prices and make all positive price changes into 1s and negative changes into 0s, and keep this binary vector as X. We’ll set nn as the number of observations we have (e.g. n = len(X)). Then, to implement the runs test, we take the following steps (adapted from section 2.3 of the NIST Statistical Test Suite):

  1. Compute the proportion of 1s in the binary sequence:
$$\pi = \frac{\sum_j X_j}{n}$$

2. Check the value π against the frequency test. It passes if: ∣π−1/2∣<τ, where τ=2/n^(1/2)​. If the frequency test is failed, then we can stop and we don’t have a random sequence and we’ll set our P-value to 0. If we pass, then we can continue to step 3.

3. Compute our test statistic Vn​ where:

$$V_n = \sum_{k=1}^{n-1} r(k) + 1$$

where r(k)=0 if X_k = X_{k+1}​, otherwise r(k)=1.

So if we have the sequence [0, 1, 0, 0, 0, 1, 1], then this becomes: Vn​=(1+1+0+0+1+0)+1=4

4. Compute our p-value where:

$$p = erfc\bigg( \frac{ \mid V_n - 2n \pi (1 - \pi) \mid}{2 \pi (1-\pi) \sqrt{2n}} \bigg)$$

Note that erfc is the complementary error function (given below). Thankfully, this is available in Python with scipy.special.erfc(z):


With all of that, we can now use our p-value to determine whether or not our sequence is random. If our P-value is below our threshold (e.g. 5%), then we reject the null hypothesis, which means we have a non-random sequence on our hands.

import numpy as np
from scipy.special import erfc

def RunsTest(x):
  # Convert input to binary values
  X = np.where(x > 0, 1, 0)
  n = len(X)
  pi = X.sum() / n
  # Check frequency test
  tau = 2 / np.sqrt(n)
  if np.abs(pi - 0.5) >= tau:
    # Failed frequency test
    return 0
  r_k = X[1:] != X[:-1]
  V_n = r_k.sum() + 1
  num = np.abs(V_n - 2 * n * pi * (1 - pi))
  den = 2 * pi * (1 - pi) * np.sqrt(2 * n)
  return erfc(num / den)

The NIST documentation gives us some test data to check that our function is working properly, so let's drop that into our function and see what happens.

# eps from NIST doc
eps = '1100100100001111110110101010001000100001011010001100001000110100110001001100011001100010100010111000'
x = np.array([int(i) for i in eps])

p = RunsTest(x)
H0 = p > 0.01
# NIST P-value = 0.500798
print("Runs Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
Runs Test
Fail to reject the Null Hypothesis (p=0.501) -> random sequence

We get the same p-value, so we can be confident that our implementation is correct. Note also that NIST recommends we have at least 100 samples in our data for this test to be valid (i.e. n≥100).

Discrete Fourier Transformation Test

Our next test is the Discrete Fourier Transformation (DFT) test.

This test computes a Fourier Transform on the data and looks at the peak heights. If their are too many high peaks, then it indicates we aren’t dealing with a random process. It would take us too far afield to dive into the specifics of Fourier Transforms, but check out this post if you’re interested to go deeper.

Let’s get to the NIST steps. We have data (xx) and we need to set a threshold, which is usually 95% as inputs.

  1. We need to convert our time-series xx into a sequence of 1s and -1s for positive and negative deviations. This new sequence is called x-hat (Medium won’t display the text properly, so see the image below or the post on our primary site here).
  2. Apply discrete Fourier Transform (DFT) to x-hat:
$$\Rightarrow S = DFT(\hat{x})$$

3. Calculate M=modulus(S′)=∣S∣, where S′ is the first 2n elements in S and the modulus yields the height of the peaks.

4. Compute the 95% peak height threshold value. If we are assuming randomness, then 95% of the values obtained from the test should not exceed T.

$$T = \sqrt{n\textrm{log}\frac{1}{0.05}}$$

5. Compute N_0 = 0.95n/2​, where N_0​ is the theoretical number of peaks (95%) that are less than T (e.g. if n=10, then N_0​=10×0.95/2​=4.75).

6. Compute 𝑁1, which is the actual number of observed peaks in 𝑀 that are less than 𝑇.

7. Compute d:

$$d = \frac{N_1 - N_0}{\sqrt{n(0.95)(0.05)/4}}$$

8. Compute the p-value using the erfc function:

$$P = erfc \bigg( \frac{\left| d \right|}{\sqrt{2}} \bigg)$$

Just like we did above, we're going to compare our p-value to our reference level and see if we can reject the null hypothesis - that we have a random sequence - or not. Note too that it is recommended that we use at least 1,000 inputs (n > 1000) for this test.

Let's turn to our Python implementation:

def DFTTest(x, threshold=0.95):
  n = len(x)
  # Convert to binary values
  X = np.where(x > 0, 1, -1)
  # Apply DFT
  S = np.fft.fft(X)
  # Calculate Modulus
  M = np.abs(S[:int(n/2)])
  T = np.sqrt(n * np.log(1 / (1 - threshold)))
  N0 = threshold * n / 2
  N1 = len(np.where(M < T)[0])
  d = (N1 - N0) / np.sqrt(n * (1-threshold) * threshold / 4)
  # Compute P-value
  return erfc(np.abs(d) / np.sqrt(2))

NIST gives us some sample data to test our implementation here too.

# Test sequence from NIST
eps = '1100100100001111110110101010001000100001011010001100001000110100110001001100011001100010100010111000'
x = np.array([int(i) for i in eps])
p = DFTTest(x)
H0 = p > 0.01
print("DFT Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
DFT Test
Fail to reject the Null Hypothesis (p=0.646) -> random sequence

Binary Matrix Rank Test

We'll choose one last test out of the test suite - the Binary Matrix Rank Test.

1. Divide the sequence into 32 by 32 blocks. We'll have $N$ total blocks to work with and discard any data that doesn't fit nicely into our 32x32 blocks. Each block will be a matrix consisting of our ordered data. A quick example will help illustrate, say we have a set of 10, binary data points: `X = [0, 0, 0, 1, 1, 0, 1, 0, 1, 0]` and we have 2x2 matrices (to make it easy) instead of 32x32. We'll divide this data into two blocks and discard two data points. So we have two blocks (B1 and B2) that now look like:

$$B_1 = \begin{bmatrix} 0 & 0 \\ 0 & 1 \end{bmatrix} $$ $$B_2 = \begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix} $$

2. We determine the rank of each binary matrix. If you're not familiar with the procedure, check out this notebook here for a great explanation. In Python, we can simply use the np.linalg.matrix_rank() function to compute it quickly.

3. Now that we have the ranks, we're going to count the number of full rank matrices (if we have 32x32 matrices, then a full rank matrix has a rank of 32) and call this number F_m. Then we'll get the number of matrices with rank one less than full rank which will be F_{m-1}. We'll use N to denote the total number of matrices we have.

4. Now, we compute the Chi-squared value for our data with the following equation:

$$\chi^2 = \frac{(F_m-0.2888N)^2}{0.2888N} + \frac{F_{m-1} - 0.5776N)^2}{0.5776N} + \frac{N - F_m - F_{m-1} - 0.1336N)^2}{0.1336N}$$

5. Calculate the P-value using the Incomplete Gamma Function, Q(1,χ^2/2​):

$$P = Q \bigg(1, \frac{\chi^2}{2} \bigg) = \frac{1}{\Gamma(a)} \int_x^{\infty} t^{a-1} e^{-1} = e^{\frac{\chi^2}{2}}$$

Scipy makes this last bit easy with a simple function call to scipy.special.gammaincc().

Don't be intimidated by this! It's actually straightforward to implement.

from scipy.special import gammaincc

def binMatrixRankTest(x, M=32):
  X = np.where(x > 0, 1, 0)
  n = len(X)
  N = np.floor(n / M**2).astype(int)
  # Create blocks
  B = X[:N * M**2].reshape(N, M, M)
  ranks = np.array([np.linalg.matrix_rank(b) for b in B])
  F_m = len(np.where(ranks==M)[0])
  F_m1 = len(np.where(ranks==M - 1)[0])
  chi_sq = (F_m - 0.2888 * N) ** 2 / (0.2888 * N) \
    + (F_m1 - 0.5776 * N) ** 2 / (0.5776 * N) \
    + (N - F_m - F_m1 - 0.1336 * N) ** 2 / (0.1336 * N)
  return gammaincc(1, chi_sq / 2)

If our p-value is less than our threshold, then we have a non-random sequence. Let's test it with the simple example given in the NIST documentation to ensure we implemented things correctly:

eps = '01011001001010101101'
X = np.array([int(i) for i in eps])
p = binMatrixRankTest(X, M=3)
H0 = p > 0.01
print("Binary Matrix Rank Test\n"+"-"*78)
if H0:
  print(f"Fail to reject the Null Hypothesis (p={p:.3f}) -> random sequence")
  print(f"Reject the Null Hypothesis (p={p:.3f}) -> non-random sequence.")
Binary Matrix Rank Test
Fail to reject the Null Hypothesis (p=0.742) -> random sequence

And it works! Note that in this example, we have a much smaller data set, so we set M=3 for 9-element matrices. This test is also very data hungry. They recommend at least 38 matrices to test. If we're using 32x32 matrices, then that means we'll need 38x32x32 = 38,912 data points. That's roughly 156 years of daily price data!

Only the oldest companies and commodities are going to have that kind of data available (and not likely for free). We'll press on with this test anyway, but take the results with a grain of salt because we're violating the data recommendations.

Testing the Random Walk Hypothesis on Market Data

With our tests in place, we can get some actual market data and see how well the RWH holds up. To do this properly, we’re going to need a lot of data, so I picked out some indices with a long history, a few old and important commodities, some of the oldest stocks out there, a few currency pairs, and Bitcoin just because.

Data from:

  • Dow Jones
  • S&P 500
  • Gold
  • Oil

One thing to note as well, we want to also run this against a baseline. For each of these I’ll be benchmarking the results against NumPy’s binomial sampling algorithm, which should have a high-degree of randomness.

I relied only on free sources so you can replicate this too, but more and better data is going to be found in paid subscriptions. I have defined a data_catalogue as a dictionary below which will contain symbols, data sources, and the like so our code knows where to go to get the data.

data_catalogue = {'DJIA':{
    'source': 'csv',
    'symbol': 'DJIA',
    'url': 'https://stooq.com/q/d/l/?s=^dji&i=d'
    'S&P500': {
        'source': 'csv',
        'symbol': 'SPX',
        'url': 'https://stooq.com/q/d/l/?s=^spx&i=d'
    'WTI': {
        'source': 'yahoo',
        'symbol': 'CL=F',
    'Gold': {
        'source': 'yahoo',
        'symbol': 'GC=F',
    'GBP': {
        'source': 'yahoo',
        'symbol': 'GBPUSD=X'
    'BTC': {
        'source': 'yahoo',
        'symbol': 'BTC-USD'

Now we’ll tie all of this together into a TestBench class. This will take our data catalogue, reshape it, and run our tests. The results are going to be collected for analysis, and I wrote a helper function to organize it into a large, Pandas dataframe for easy viewing.

import pandas as pd
import pandas_datareader as pdr
import yfinance as yf
from datetime import datetime

class TestBench:

  data_catalogue = data_catalogue

  test_names = ['runs-test',

  def __init__(self, p_threshold=0.05, seed=101, 
               dftThreshold=0.95, bmrRows=32):
    self.seed = seed
    self.p_threshold = p_threshold
    self.dftThreshold = dftThreshold
    self.bmrRows = bmrRows
    self.years = [1, 4, 7, 10]
    self.trading_days = 250
    self.instruments = list(self.data_catalogue.keys())

  def getData(self):
    self.data_dict = {}
    for instr in self.instruments:
        data = self._getData(instr)
      except Exception as e:
        print(f'Unable to load data for {instr}')
      self.data_dict[instr] = data.copy()
    self.data_dict['baseline'] = np.random.binomial(1, 0.5, 
      size=self.trading_days * max(self.years) * 10)

  def _getData(self, instr):
    source = self.data_catalogue[instr]['source']
    sym = self.data_catalogue[instr]['symbol']
    if source == 'yahoo':
      return self._getYFData(sym)
    elif source == 'csv':
      return self._getCSVData(self.data_catalogue[instr]['url'])
    elif source == 'fred':
      return self._getFREDData(sym)

  def _getCSVData(self, url):
    data = pd.read_csv(url)
    close_idx = [i 
      for i, j in enumerate(data.columns) if j.lower() == 'close']
    assert len(close_idx) == 1, f"Can't match column names.\n{data.columns}"
      std_data = self._standardizeData(data.iloc[:, close_idx[0]])
    except Exception as e:
      raise ValueError(f"{url}")
    return std_data

  def _getYFData(self, sym):
    yfObj = yf.Ticker(sym)
    data = yfObj.history(period='max')
    std_data = self._standardizeData(data)
    return std_data

  def _getFREDData(self, sym):
    data = pdr.DataReader(sym, 'fred')
    data.columns = ['Close']
    std_data = self._standardizeData(data)
    return std_data

  def _standardizeData(self, df):
    # Converts data from different sources into np.array of price changes
      return df['Close'].diff().dropna().values
    except KeyError:
      return df.diff().dropna().values

  def runTests(self):
    self.test_results = {}
    for k, v in self.data_dict.items():
      self.test_results[k] = {}
      for t in self.years:
        self.test_results[k][t] = {}
        data = self._reshapeData(v, t)
        if data is None:
          # Insufficient data

        self.test_results[k][t]['runs-test'] = np.array([self._runsTest(x) 
          for x in data])
        self.test_results[k][t]['dft-test'] = np.array([self._dftTest(x) 
          for x in data])
        self.test_results[k][t]['bmr-test'] = np.array([self._bmrTest(x)
          for x in data])

        print(f"Years = {t}\tSamples = {data.shape[0]}")

  def _reshapeData(self, X, years):
    d = int(self.trading_days * years) # Days per sample
    N = int(np.floor(X.shape[0] / d)) # Number of samples
    if N == 0:
      return None
    return X[-N*d:].reshape(N, -1)

  def _dftTest(self, data):
    return DFTTest(data, self.dftThreshold)

  def _runsTest(self, data):
    return RunsTest(data)

  def _bmrTest(self, data):
    return binMatrixRankTest(data, self.bmrRows)

  def tabulateResults(self):
    # Tabulate results
    table = pd.DataFrame()
    row = {}
    for k, v in self.test_results.items():
      row['Instrument'] = k
      for k1, v1 in v.items():
        row['Years'] = k1
        for k2, v2 in v1.items():
          pass_rate = sum(v2>self.p_threshold) / len(v2) * 100
          row['Test'] = k2
          row['Number of Samples'] = len(v2)
          row['Pass Rate'] = pass_rate
          row['Mean P-Value'] = v2.mean()
          row['Median P-Value'] = np.median(v2)
          table = pd.concat([table, pd.DataFrame(row, index=[0])])

    return table

We can initialize our test bench and call the getData() and runTests() method to put it all together. The tabulateResults() method will give us a nice table for viewing.

When we run our tests, we have a print out for the number of years and full samples of data we have. You’ll notice that for some of these (e.g. Bitcoin) we just don’t have a great amount of data to go off of, but we’ll do our best with what we do have.

tests = TestBench()
Years = 1	Samples = 129
Years = 4	Samples = 32
Years = 7	Samples = 18
Years = 10	Samples = 12
Years = 1	Samples = 154
Years = 4	Samples = 38
Years = 7	Samples = 22
Years = 10	Samples = 15
Years = 1	Samples = 21
Years = 4	Samples = 5
Years = 7	Samples = 3
Years = 10	Samples = 2
Years = 1	Samples = 20
Years = 4	Samples = 5
Years = 7	Samples = 2
Years = 10	Samples = 2
Years = 1	Samples = 18
Years = 4	Samples = 4
Years = 7	Samples = 2
Years = 10	Samples = 1
Years = 1	Samples = 10
Years = 4	Samples = 2
Years = 7	Samples = 1
Years = 10	Samples = 1
Years = 1	Samples = 100
Years = 4	Samples = 25
Years = 7	Samples = 14
Years = 10	Samples = 10

We have 129 years of Dow Jones data, which gives us 12, 10-year samples and 154 years for the S&P 500 (the index doesn’t go back that far, but our data source provides monthly data going back to 1789). This is in contrast to most of our other values which have two decades or less.

To take a look at the results, we can run the tabulateResults() method, and do some pivoting to reshape the data frame for easier viewing.

table = tests.tabulateResults()
pivot = table.pivot_table(index=['Instrument', 'Years'], columns='Test')
samps = pivot['Number of Samples'].drop(['bmr-test', 'dft-test'], axis=1)
pivot.drop(['Number of Samples'], axis=1, inplace=True)
pivot['Number of Samples'] = samps

Let’s start with the baseline.

As expected, NumPy’s random number generator is pretty good, and it passes most of the tests without issue. The median P-values for the runs and DFT tests remain fairly high as well, although they are lower for the BMR test. Another thing to note, the 1 and 4 year BMR tests didn’t return any values because we were unable to complete a single 32×32 matrix with such small sample sizes. Overall, the lack of data for the BMR test makes the results here dubious (we could recalculate it with a smaller matrix size, but we’d need to recalibrate all of the probabilities for these different matrices).

The DFT test showed randomness for most cases in our test set. For what it’s worth, the P-values for our DFT tests of all sizes remained fairly high regardless of the sample size.

The runs test provides the most varied and interesting results.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
for i, instr in enumerate(tests.instruments):
  sub = table.loc[(table['Instrument']==instr) &
  plt.plot(tests.years, sub['Pass Rate'], label=instr, 
           c=colors[i], marker='o')
plt.ylabel('Pass Rate (%)')
plt.title('Runs Test Pass Rate for all Instruments')

The runs test tends to produce less random results as time goes on. The notable exception being our WTI data, which passes more tests for randomness over time. However, if we look at our P-values, we do see them falling towards 0 (recall, our null hypothesis is that these are random processes).

plt.figure(figsize=(12, 8))
for i, instr in enumerate(table['Instrument'].unique()):
  sub = table.loc[(table['Instrument']==instr) &
  plt.plot(tests.years, sub['Median P-Value'], label=instr, 
           c=colors[i], marker='o')
plt.title('Median P-Values for Runs Test for all Instruments')

We added the baseline to this plot to show that it remains high even as the time frame increases, whereas all other values become less random over time. We’re showing P-values here, which are the probabilities that the results are due to noise if the process we’re testing is random. In other words, the lower our values become, the less likely it is that we have a random process on our hands.

This downward sloping trend may provide evidence that supports the value of longer-term trading.

Jerry Parker, for example, has moved toward longer-term trend signals (e.g. >200 day breakouts) because the short term signals are no longer profitable in his system. Data is going to be limited, but it could be interesting to run this over multiple, overlapping samples as in a walk forward analysis to see if randomness in the past was lower during shorter time frames. Additionally, there are more statistical tests we could look at to try to tease this out.

Death of the Random Walk Hypothesis?

The evidence from these few tests is mixed. Some tests show randomness, others provide an element of predictability. Unfortunately, we can’t definitively say the RWH is dead (although I think it, and the theories it is based on, are more articles of academic faith than anything).

To improve our experiment we need more data and more tests. We also used a series of binary tests, although technically the RWH asserts that the changes in price are normally distributed, so statistical tests that look for these patterns could strengthen our methodology and lead to more robust conclusions.

If you’d like to see more of this, drop us a note at hello@raposa.trade and let us know what you think!