Learning to Trade via Direct Reinforcement

Learning to Trade via Direct Reinforcement

Authors: John Moody & Matthew Saffel
Date: 2001

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

I. Introduction

Trader goal: optimize some measure of trading performance.
Investment performance depends upon sequences of interdependent decisions. ⇒ Path dependent.
RRL is an adaptative policy search that can learn an investment strategy online.
In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
Frequently used class of performance criteria: measures of risk-adjusted investment returns.
RRL can balance the accumulation of returns with the avoidance of risk.
We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.

II. Trading systems and performance criteria

A. Structure of trading systems

We consider agents that trade fixed position sizes in a single security.
Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\).
Price serie denoted \(z_t\).
Position \(F_t\) is established (or maintained) at the end of each time interval t ⇒trade is possible at the end of each time period.
Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes:
1. The profit/loss resulting from held position \(F_{t-1}\)
2. The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\).
Trader must have internal state information and must therefore be recurrent.
We use the following decision function:

\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\]

With:
- \(\theta_t\) : learned system parameters at time t.
- \(I_t\) : information set at time t.
- \(z_t\) : price serie.
- \(y_t\) : external variables.

Simple {long, short} trader example with m+1 autoregressive inputs:

\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]

Where \(r_t = z_t - z_{t-1}\) are the price returns of \(z_t\) and the parameters are \(\theta = \{u, v_i, w\}\).

This is a discrete-action, deterministic trader.

Continuous function generalization

We can use a continuously valued F() by replacing sign with tanh.
\(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.

Stochastic framework generalization

Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\].
Noise level controls “exploration vs exploitation” behavior.

B. Profit and wealth for trading systems

Trading systems optimized by maximazing performance function U()

Additive profits

If each trade is of fixed size.
\(r_t = z_t - z_{t-1}\): price returns of risky asset.
\(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills)
Transaction cost rate: \(\delta\)
Trading position size: \(\mu > 0\)
Additive profits accumulated over T periods:

\[P_T = \sum\limits_{t=1}^T R_t\]

Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\]
Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\)
When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]

The wealth of the trader is defined as: \(W_T = W_0 + P_T\).

Multiplicative profits

Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade.
We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\)
If we assume no short sale and \(\nu = 1\) then the wealth at T is:

\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]

Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\)
When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]

C. Performance criteria

We consider perfomance criteria as function of the wealth: \(U(W_T)\)
Or more generally: \(U(W_T, \dots, W_1, W_0)\)
In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\)
For trader optimization we are interested in the marginal increase in performance due to return \(R_t\):

\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]

We call \(D_t\) the differential performance criteria.
Note that \(U_{t-1}\) doesn't depend on \(R_t\)

D. Differential Sharpe Ratio

Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\)
With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\)

Differential Sharpe Ratio is obtained by considering exponantial moving average of returns and standard deviations and expanding to first order in the adaptation rate \(\eta\):

\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\] \[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]

Then we have \(S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}\), and:

\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]

A zero adaptation correspond to an infinite time average.
Thus expanding about \(\eta=0\) will correspond to “just turning on” the adaptation.
We define the Differential Sharpe ratio as:

\[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

\(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\).
Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get:

\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\)
⇒ The Sharpe ratio criteria will penalize larger gains.

E. Downside risk

Variance is more and more considered as an inadequate measure due to previously mentioned issue.
Other options are:
1. Downside Deviation (DD)
2. Second Lower Partial Moment (SLPM)
3. Nth Lower Partial Moment
4. Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years).
Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).

Downside Deviation defined as: \(DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12\)

We the define our utility function, the Downside Deviation Ratio (DDR):

\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]

Next we define the exponential moving average (EMA) of returns and \(DD_t^2\):

\[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]

And we use the exponantial moving version of DDR_t: \(DDR_t = \frac{A_t}{DD_t}\)

We consider the first order expansion in adaptation rate \(\eta\):

\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]

And we can then define the Differential Downside Deviation Ratio:

\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]

III. Learning to Trade

Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
Accomplished with trial and errors exploration of the environment and space of strategies.

Supervised learning is effective for structural credit assignment issue, not for temporal credit assignment.
Structural credit assignment: assign credits to the parameters of a problem.
Temporal credit assignment: assign credits to the individual actions taken over time.
⇒ Reinforcement learning tries to solve both problems at the same time.

A. Recurrent Reinforcement Learning

Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\)
For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is:

\[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]

The system can be optimized in batch mode: repeatedly computing the value of \(U_T\) on forward passes and adjusting the parameters by using gradient ascent (with learning rate \(\rho\)):

\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]

Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to Back-Propagation Through Time (BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\]

We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the batch RRL algorithm.

There are 2 ways to extend this batch algorithm into a stochastic framework:
1. Exploration of strategy space can be induced by incorporating a noise variable \(\epsilon_t\) (as in \(\eqref{eq_Ft}\)). In that case:
  1. trade-off between exploration of the strategy space and exploitation of a learned policy can be controlled by the amplitude of the noise variance \(\sigma_\epsilon\).
  2. The noise magnetude can be annealed over time to arrive at a good strategy.
2. A simple online stochatic optimization can be obtained by considering only the term in \(\eqref{eq1_batch_rrl}\) that depends on the most recently realized return \(R_t\) (during the forward pass):

\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]

The parameters are then updated online:

\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]

This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data.

The stochastic online analog to \(\eqref{eq3_batch_rrl}\) is: \[\begin{equation}\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}\end{equation}\]

Equations \(\eqref{eq1_online_rrl}\), \(\eqref{eq2_online_rrl}\) and \(\eqref{eq3_online_rrl}\) constitute the stochastic (or adaptative) RRL algorithm.

⇒ This is a reinforcement algorithm closely related to recurrent supervised algorithms such as Real Time Recurrent Learning (RTRL) and Dynamic Backpropagation

When considering differential performance criteria \(D_t\) as described in \(\eqref{eq_Dt}\), the stochastic update equations become:

\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]

Note that for financial data adding a noise variable \(\epsilon_t\) doesn't provide any significant advantage since the input data already contain significant noise.

B. Value functions and Q-Learning

Provide proper references on this chapter ?

IV. Empirical Results

Using 3 test cases:
1. Artificial prices series (using Sharpe ratio)
2. Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
3. Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.

A. Trader Simulation

Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\).
Experiment demonstrate that:
1. RRL is an effective mean of learning trading strategies
2. Trading frequency is reduced as expected as transaction costs increase.

A.1 Data

Generating log price series as random walks with autoregressive trend processes:

\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]

Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\)

Artificial prices then defined as: \(z(t) = exp\left(\frac{p(t)}{R}\right)\), where \(R = max(p(t)) - min(p(t))\).

Experiments were done with 10000 samples and \(\alpha = 0.9\) and \(k = 3\)

A.2 Simulated Trading Results

Input at time t constructed from the previous 8 returns.
RRL trader initialized randomly
Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
Transaction cost fixed at 0.5% during learning and trading
Transient effects of initial learning visible during the first 2000 time steps.
In these simulations the 10000 samples are partitioned in:
1. 1000 samples training set
2. 9000 samples test set.
Traders are first optimized on the training data set for 100 epochs
Then adapted online on the test data set.

In 100 experiments, positive Sharpe ratio are always obtained.
Ad as expected trading frequency is reduced as transaction costs increase.

B. US Dollar/British Pound Foreign exchange trading system

Using half-hourly USDGBP security.
Training a {long,short,neutral} trading system.
Trading system incuring transaction cost from bid-ask spread
Training to maximize the differential Downside Deviation Ratio.
System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
Using a EMA Sharpe Ratio with time constant of 0.01.

C. S&P 500 / T-Bill Asset Allocation

Provide proper references on this chapter ?

RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01)

Sensitivity of input is is defined as : \(S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}\)

V. Learn the Policy or Learn the Value ?

Provide proper references on this chapter ?

VI. Conclusions

Provide proper references on this chapter ?

Table of Contents

Learning to Trade via Direct Reinforcement

I. Introduction

II. Trading systems and performance criteria

A. Structure of trading systems

Continuous function generalization

Stochastic framework generalization

B. Profit and wealth for trading systems

Additive profits

Multiplicative profits

C. Performance criteria

D. Differential Sharpe Ratio

E. Downside risk

III. Learning to Trade

A. Recurrent Reinforcement Learning

B. Value functions and Q-Learning

IV. Empirical Results

A. Trader Simulation

A.1 Data

A.2 Simulated Trading Results

B. US Dollar/British Pound Foreign exchange trading system

C. S&P 500 / T-Bill Asset Allocation

V. Learn the Policy or Learn the Value ?

VI. Conclusions