Learning to Trade via Direct Reinforcement
Authors: John Moody & Matthew Saffel
Date: 2001
Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.
I. Introduction
Trader goal: optimize some measure of trading performance.
Investment performance depends upon sequences of interdependent decisions. ⇒ Path dependent.
RRL is an adaptative policy search that can learn an investment strategy online.
In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
Frequently used class of performance criteria: measures of risk-adjusted investment returns.
RRL can balance the accumulation of returns with the avoidance of risk.
We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.
A. Structure of trading systems
We consider agents that trade fixed position sizes in a single security.
Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\).
Price serie denoted \(z_t\).
Position \(F_t\) is established (or maintained) at the end of each time interval t ⇒trade is possible at the end of each time period.
Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes:
The profit/loss resulting from held position \(F_{t-1}\)
The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\).
Trader must have internal state information and must therefore be recurrent.
We use the following decision function:
\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\]
With:
\(\theta_t\) : learned system parameters at time t.
\(I_t\) : information set at time t.
\(z_t\) : price serie.
\(y_t\) : external variables.
\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]
Continuous function generalization
We can use a continuously valued F() by replacing sign with tanh.
\(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.
Stochastic framework generalization
Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\].
Noise level controls “exploration vs exploitation” behavior.
B. Profit and wealth for trading systems
Additive profits
If each trade is of fixed size.
\(r_t = z_t - z_{t-1}\): price returns of risky asset.
\(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills)
Transaction cost rate: \(\delta\)
Trading position size: \(\mu > 0\)
Additive profits accumulated over T periods:
\[P_T = \sum\limits_{t=1}^T R_t\]
Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\]
Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\)
When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:
\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]
Multiplicative profits
Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade.
We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\)
If we assume no short sale and \(\nu = 1\) then the wealth at T is:
\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]
Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\)
When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:
\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]
We consider perfomance criteria as function of the wealth: \(U(W_T)\)
Or more generally: \(U(W_T, \dots, W_1, W_0)\)
In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\)
For trader optimization we are interested in the marginal increase in performance due to return \(R_t\):
\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]
D. Differential Sharpe Ratio
Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\)
With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\)
\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\]
\[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]
\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]
A zero adaptation correspond to an infinite time average.
Thus expanding about \(\eta=0\) will correspond to “just turning on” the adaptation.
We define the Differential Sharpe ratio as:
\[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]
\(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\).
Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get:
\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]
Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\)
⇒ The Sharpe ratio criteria will penalize larger gains.
E. Downside risk
Variance is more and more considered as an inadequate measure due to previously mentioned issue.
Other options are:
Downside Deviation (DD)
Second Lower Partial Moment (SLPM)
Nth Lower Partial Moment
Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years).
Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).
\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]
\[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]
\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]
\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]
III. Learning to Trade
Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
Accomplished with trial and errors exploration of the environment and space of strategies.
Supervised learning is effective for structural credit assignment issue, not for temporal credit assignment.
Structural credit assignment: assign credits to the parameters of a problem.
Temporal credit assignment: assign credits to the individual actions taken over time.
⇒ Reinforcement learning tries to solve both problems at the same time.
A. Recurrent Reinforcement Learning
Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\)
For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is:
\[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]
\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]
Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to Back-Propagation Through Time (BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\]
We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the batch RRL algorithm.
\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]
\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]
\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]
B. Value functions and Q-Learning
Provide proper references on this chapter ?
IV. Empirical Results
Using 3 test cases:
Artificial prices series (using Sharpe ratio)
Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.
A. Trader Simulation
Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\).
Experiment demonstrate that:
RRL is an effective mean of learning trading strategies
Trading frequency is reduced as expected as transaction costs increase.
A.1 Data
\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]
Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\)
A.2 Simulated Trading Results
Input at time t constructed from the previous 8 returns.
RRL trader initialized randomly
Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
Transaction cost fixed at 0.5% during learning and trading
Transient effects of initial learning visible during the first 2000 time steps.
In these simulations the 10000 samples are partitioned in:
1000 samples training set
9000 samples test set.
Traders are first optimized on the training data set for 100 epochs
Then adapted online on the test data set.
In 100 experiments, positive Sharpe ratio are always obtained.
Ad as expected trading frequency is reduced as transaction costs increase.
B. US Dollar/British Pound Foreign exchange trading system
Using half-hourly USDGBP security.
Training a {long,short,neutral} trading system.
Trading system incuring transaction cost from bid-ask spread
Training to maximize the differential Downside Deviation Ratio.
System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
Using a EMA Sharpe Ratio with time constant of 0.01.
C. S&P 500 / T-Bill Asset Allocation
Provide proper references on this chapter ?
V. Learn the Policy or Learn the Value ?
Provide proper references on this chapter ?
VI. Conclusions
Provide proper references on this chapter ?