Module 1 · Chapter 12 · Lesson 3

Essential Libraries: NumPy, Pandas, Statsmodels, Scikit-Learn

5 min readSetting Up Your Trading Infrastructure
The Black Book of Day Trading Strategies
Free Book

The Black Book of Day Trading Strategies

1,000 complete strategies · 31 chapters · Full trade plans

Data Handling with NumPy and Pandas

Mean reversion strategies need sound data handling. NumPy offers core numerical functions. Pandas builds on NumPy, providing effective data structures for financial time series.

Use NumPy for fast array calculations. For instance, determine a stock's daily returns.

python
import numpy as np

# Sample closing prices for AAPL over 5 days
aapl_prices = np.array([150.00, 151.50, 149.80, 152.10, 150.50])

# Calculate daily returns: (P_t / P_{t-1}) - 1
daily_returns = (aapl_prices[1:] / aapl_prices[:-1]) - 1
print("AAPL Daily Returns:", daily_returns)

Output: AAPL Daily Returns: [ 0.01 -0.0112 0.0153 -0.0105]

Pandas DataFrames arrange tabular data. Each column represents a variable. Each row represents an observation. This layout suits financial data.

Make a DataFrame for several stock prices. Include dates as the index.

python
import pandas as pd

# Sample closing prices for AAPL and MSFT
data = {
    'AAPL': [150.00, 151.50, 149.80, 152.10, 150.50],
    'MSFT': [280.00, 282.50, 279.00, 281.80, 279.50]
}
dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])

df = pd.DataFrame(data, index=dates)
print("\nStock Prices DataFrame:\n", df)

# Calculate daily returns for the DataFrame
df_returns = df.pct_change().dropna()
print("\nDaily Returns DataFrame:\n", df_returns)

Output:

Stock Prices DataFrame:
             AAPL    MSFT
2023-01-01  150.00  280.00
2023-01-02  151.50  282.50
2023-01-03  149.80  279.00
2023-01-04  152.10  281.80
2023-01-05  150.50  279.50

Daily Returns DataFrame:
                 AAPL      MSFT
2023-01-02  0.010000  0.008929
2023-01-03 -0.011221 -0.012389
2023-01-04  0.015354  0.010036
2023-01-05 -0.010519 -0.008162

Pandas manages absent data. It performs rolling window computations. This matters for moving averages and standard deviations in mean reversion.

Determine a 3-day rolling mean for AAPL prices.

python
df['AAPL_3d_MA'] = df['AAPL'].rolling(window=3).mean()
print("\nAAPL with 3-day Rolling Mean:\n", df)

Output:

AAPL with 3-day Rolling Mean:
             AAPL    MSFT  AAPL_3d_MA
2023-01-01  150.00  280.00         NaN
2023-01-02  151.50  282.50         NaN
2023-01-03  149.80  279.00      150.43
2023-01-04  152.10  281.80      151.13
2023-01-05  150.50  279.50      150.80

Statistical Analysis with Statsmodels

Statsmodels offers statistical modeling tools. It performs regression analysis. It checks for stationarity, a central idea in mean reversion.

Use Statsmodels for Ordino Least Squares (OLS) regression. Find relationships between variables. For example, regress MSFT returns on AAPL returns.

python
import statsmodels.api as sm

# Use the daily returns calculated earlier
X = df_returns['AAPL'] # Independent variable
y = df_returns['MSFT'] # Dependent variable

# Add a constant to the independent variable for the intercept
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print("\nOLS Regression Results (MSFT vs AAPL returns):\n", results.summary())

The summary output gives coefficients, R-squared, and p-values. A low p-value for the AAPL coefficient shows a statistically meaningful relationship. This informs pair trading strategies.

Test for stationarity using the Augmented Dickey-Fuller (ADF) test. Mean reversion needs stationo or cointegrated time series. A stationo series moves around a fixed mean.

Consider a manufactured mean-reverting series.

python
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt

# Generate a synthetic mean-reverting series
np.random.seed(42)
series = np.zeros(100)
mean_val = 100
alpha = 0.1 # Reversion speed
volatility = 1.0

for i in range(1, 100):
    series[i] = series[i-1] + alpha * (mean_val - series[i-1]) + np.random.normal(0, volatility)

# Add some initial noise to make it more realistic
series[0] = np.random.normal(mean_val, volatility * 5)

# Plot the series
plt.figure(figsize=(10, 4))
plt.plot(series)
plt.title("Synthetic Mean-Reverting Series")
plt.xlabel("Time")
plt.ylabel("Value")
plt.grid(True)
plt.show()

# Perform ADF test
adf_result = adfuller(series)
print(f"\nADF Statistic: {adf_result[0]:.2f}")
print(f"P-value: {adf_result[1]:.3f}")
print("Acceptance Values:")
for key, value in adf_result[4].items():
    print(f"   {key}: {value:.2f}")

if adf_result[1] < 0.05:
    print("The series is likely stationo (reject null hypothesis).")
else:
    print("The series is likely non-stationo (fail to reject null hypothesis).")

The ADF test's null hypothesis states non-stationarity. A p-value below 0.05 rejects this null hypothesis. This indicates stationarity. Mean reversion strategies work well with stationo series.

Statsmodels also offers cointegration tests. The Engle-Granger two-step method finds cointegrated pairs. Cointegration means two non-stationo series have a stationo linear combination. This forms the foundation of many pairs trading strategies.

Machine Learning for Pattern Recognition with Scikit-Learn

Scikit-learn provides machine learning algorithms. Apply these to find mean-reverting patterns. Predict future price movements.

Use Scikit-learn for classification or regression tasks. For example, predict if a stock will return to its mean.

Consider a simple mean reversion signal. If a stock deviates significantly from its moving average, it might revert.

Create a feature: deviation from a 20-day moving average. Create a target: whether the stock closed higher or lower after 1 day.

python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Fetch real stock data for a longer period
# In a real scenario, you would use an API like yfinance
# For this example, let's create a synthetic dataset resembling stock prices
np.random.seed(42)
num_days = 252 # ~1 year of trading days
prices = 100 + np.cumsum(np.random.normal(0, 1, num_days)) + np.sin(np.linspace(0, 20, num_days)) * 5
prices_df = pd.DataFrame({'Close': prices})

# Calculate 20-day moving average
prices_df['MA20'] = prices_df['Close'].rolling(window=20).mean()

# Calculate deviation from MA
prices_df['Deviation'] = prices_df['Close'] - prices_df['MA20']

# Create a target variable: 1 if price increased next day, 0 otherwise
# Shift by -1 to predict next day's movement
prices_df['Next_Day_Move'] = (prices_df['Close'].shift(-1) > prices_df['Close']).astype(int)

# Drop rows with NaN values (due to rolling mean and shifting)
prices_df.dropna(inplace=True)

# Define features (X) and target (y)
X = prices_df[['Deviation']]
y = prices_df['Next_Day_Move']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train a Logistic Regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test, y_pred))

The classification report shows precision, recall, and F1-score. These metrics assess the model's ability to predict price movements. High precision for predicting a price increase (class 1) means fewer false positives.

Scikit-learn also supports clustering algorithms. Use K-Means to group similar stocks. Find groups of assets that show similar mean-reverting behavior. This expands pair trading from two assets to baskets of assets.

python
from sklearn.cluster import KMeans

# Using the daily returns from the earlier example (AAPL, MSFT)
# In a real scenario, you'd use many more stocks and a longer history
# For demonstration, we'll use the existing df_returns and add a third synthetic stock
data_for_clustering = df_returns.copy()
data_for_clustering['GOOG'] = np.random.normal(0.001, 0.015, len(data_for_clustering)) # Synthetic GOOG returns

# Initialize KMeans with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)

# Fit the model and predict clusters
clusters = kmeans.fit_predict(data_for_clustering.T) # Transpose to cluster stocks, not days

print("\nStock Clusters (K-Means):\n")
for i, stock in enumerate(data_for_clustering.columns):
    print(f"Stock: {stock}, Cluster: {clusters[i]}")

Output:

Stock Clusters (K-Means):

Stock: AAPL, Cluster: 0
Stock: MSFT, Cluster: 0
Stock: GOOG, Cluster: 1

This example shows how stocks can be grouped. AAPL and MSFT returns might correlate more, placing them in one cluster. GOOG (synthetic, random) might fall into another. This informs portfolio construction and risk management for mean reversion.

These libraries form the foundation of quantitative trading infrastructure. Master them for effective mean reversion strategy development.