# Learning to Trade via Direct Reinforcement

Authors: John Moody & Matthew Saffel

Date: 2001

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

## I. Introduction

- Trader goal: optimize some measure of trading performance.
- Investment performance depends upon sequences of interdependent decisions. ⇒ Path dependent.
- RRL is an adaptative policy search that can learn an investment strategy online.
- In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
- Frequently used class of performance criteria: measures of risk-adjusted investment returns.
- RRL can balance the accumulation of returns with the avoidance of risk.
- We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.

## II. Trading systems and performance criteria

### A. Structure of trading systems

- We consider agents that trade
**fixed position sizes**in a single security. - Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\).
- Price serie denoted \(z_t\).
- Position \(F_t\) is established (or maintained)
**at the end**of each time interval t ⇒trade is possible at the end of each time period. - Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes:
- The profit/loss resulting from held position \(F_{t-1}\)
- The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\).

- Trader must have internal state information and must therefore be
**recurrent**. - We use the following decision function:

\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},\ldots; y_t, y_{t-1}, y_{t-2}, \ldots)\end{split}\end{equation}\label{eq1}\]

- With:
- \(\theta_t\) : learned system parameters at time t.
- \(I_t\) : information set at time t.
- \(z_t\) : price serie.
- \(y_t\) : external variables.

- Simple {long, short} trader example with m+1 autoregressive inputs:

\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]

- Where \(r_t = z_t - z_{t-1}\) are the price returns of \(z_t\) and the parameters are \(\theta = \{u, v_i, w\}\).

- This is a discrete-action,
**deterministic**trader.

#### Continuous function generalization

- We can use a continuously valued F() by replacing sign with tanh.
- \(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.

#### Stochastic framework generalization

- Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\].
- Noise level controls “exploration vs exploitation” behavior.

### B. Profit and wealth for trading systems

- Trading systems optimized by maximazing performance function U()

#### Additive profits

- If each trade is of fixed size.
- \(r_t = z_t - z_{t-1}\): price returns of risky asset.
- \(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills)
- Transaction cost rate: \(\delta\)
- Trading position size: \(\mu > 0\)
- Additive profits accumulated over T periods:

\[P_T = \sum\limits_{t=1}^T R_t\]

- Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\]
- Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\)
- When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]

- The wealth of the trader is defined as: \(W_T = W_0 + P_T\).

#### Multiplicative profits

- Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade.
- We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\)
- If we assume no short sale and \(\nu = 1\) then the wealth at T is:

\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]

- Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\)
- When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]

### C. Performance criteria

- We consider perfomance criteria as function of the wealth: \(U(W_T)\)
- Or more generally: \(U(W_T, \dots, W_1, W_0)\)
- In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\)
- For trader optimization we are interested in the
**marginal increase in performance due to return \(R_t\)**:

\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]

- We call \(D_t\) the differential performance criteria.
- Note that \(U_{t-1}\) doesn't depend on \(R_t\)

### D. Differential Sharpe Ratio

- Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\)
- With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\)

**Differential Sharpe Ratio**is obtained by considering exponantial moving average of returns and standard deviations and expanding to first order in the adaptation rate \(\eta\):

\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\] \[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]

- Then we have \(S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}\), and:

\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]

- A zero adaptation correspond to an infinite time average.
- Thus expanding about \(\eta=0\) will correspond to “
**just turning on**” the adaptation. - We define the
**Differential Sharpe ratio**as:

\[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

- \(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\).
- Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get:

\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

- Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\)
- ⇒ The Sharpe ratio criteria will penalize larger gains.

### E. Downside risk

- Variance is more and more considered as an inadequate measure due to previously mentioned issue.
- Other options are:
- Downside Deviation (DD)
- Second Lower Partial Moment (SLPM)
- Nth Lower Partial Moment
- Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years).

- Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).

- Downside Deviation defined as: \(DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12\)

- We the define our utility function, the
**Downside Deviation Ratio**(DDR):

\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]

- Next we define the exponential moving average (EMA) of returns and \(DD_t^2\):

\[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]

- And we use the exponantial moving version of DDR_t: \(DDR_t = \frac{A_t}{DD_t}\)

- We consider the first order expansion in adaptation rate \(\eta\):

\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]

- And we can then define the
**Differential Downside Deviation Ratio**:

\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]

## III. Learning to Trade

- Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
- Accomplished with trial and errors exploration of the environment and space of strategies.

- Supervised learning is effective for
**structural credit assignment**issue, not for**temporal credit assignment**. - Structural credit assignment: assign credits to the parameters of a problem.
- Temporal credit assignment: assign credits to the individual actions taken over time.
- ⇒ Reinforcement learning tries to solve both problems at the same time.

### A. Recurrent Reinforcement Learning

- Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\)
- For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is:

\[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]

- The system can be optimized in batch mode: repeatedly computing the value of \(U_T\) on forward passes and adjusting the parameters by using gradient ascent (with learning rate \(\rho\)):

\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]

- Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to
**Back-Propagation Through Time**(BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\]

- We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
- Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the
**batch RRL algorithm**.

- There are 2 ways to extend this batch algorithm into a stochastic framework:
- Exploration of strategy space can be induced by incorporating a noise variable \(\epsilon_t\) (as in \(\eqref{eq_Ft}\)). In that case:
- trade-off between
**exploration**of the strategy space and**exploitation**of a learned policy can be controlled by the amplitude of the noise variance \(\sigma_\epsilon\). - The noise magnetude can be annealed over time to arrive at a good strategy.

- A simple online stochatic optimization can be obtained by considering only the term in \(\eqref{eq1_batch_rrl}\) that depends on the most recently realized return \(R_t\) (during the forward pass):

\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]

- The parameters are then updated online:

\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]

- This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data.

- The stochastic online analog to \(\eqref{eq3_batch_rrl}\) is: \[\begin{equation}\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}\end{equation}\]

- Equations \(\eqref{eq1_online_rrl}\), \(\eqref{eq2_online_rrl}\) and \(\eqref{eq3_online_rrl}\) constitute the
**stochastic (or adaptative) RRL algorithm**.

- ⇒ This is a reinforcement algorithm closely related to recurrent supervised algorithms such as
**Real Time Recurrent Learning (RTRL)**and**Dynamic Backpropagation**

- When considering differential performance criteria \(D_t\) as described in \(\eqref{eq_Dt}\), the stochastic update equations become:

\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]

- Note that for financial data adding a noise variable \(\epsilon_t\) doesn't provide any significant advantage since the input data already contain significant noise.

### B. Value functions and Q-Learning

## IV. Empirical Results

- Using 3 test cases:
- Artificial prices series (using Sharpe ratio)
- Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
- Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.

### A. Trader Simulation

- Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\).
- Experiment demonstrate that:
- RRL is an effective mean of learning trading strategies
- Trading frequency is reduced as expected as transaction costs increase.

#### A.1 Data

- Generating log price series as random walks with autoregressive trend processes:

\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]

- Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\)

- Artificial prices then defined as: \(z(t) = exp\left(\frac{p(t)}{R}\right)\), where \(R = max(p(t)) - min(p(t))\).

- Experiments were done with 10000 samples and \(\alpha = 0.9\) and \(k = 3\)

#### A.2 Simulated Trading Results

- Input at time t constructed from the previous 8 returns.
- RRL trader initialized randomly
- Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
- Transaction cost fixed at 0.5% during learning and trading
- Transient effects of initial learning visible during the first 2000 time steps.
- In these simulations the 10000 samples are partitioned in:
- 1000 samples training set
- 9000 samples test set.

- Traders are first optimized on the training data set for 100 epochs
- Then adapted online on the test data set.

- In 100 experiments, positive Sharpe ratio are always obtained.
- Ad as expected trading frequency is reduced as transaction costs increase.

### B. US Dollar/British Pound Foreign exchange trading system

- Using half-hourly USDGBP security.
- Training a {long,short,neutral} trading system.
- Trading system incuring transaction cost from bid-ask spread
- Training to maximize the differential Downside Deviation Ratio.
- System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
- Using a EMA Sharpe Ratio with time constant of 0.01.

### C. S&P 500 / T-Bill Asset Allocation

- RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01)

- Sensitivity of input is is defined as : \(S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}\)