# Learning to Trade via Direct Reinforcement

Authors: John Moody & Matthew Saffel
Date: 2001

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

## I. Introduction

• Investment performance depends upon sequences of interdependent decisions. ⇒ Path dependent.
• RRL is an adaptative policy search that can learn an investment strategy online.
• In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
• Frequently used class of performance criteria: measures of risk-adjusted investment returns.
• RRL can balance the accumulation of returns with the avoidance of risk.
• We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.

## II. Trading systems and performance criteria

### A. Structure of trading systems

• We consider agents that trade fixed position sizes in a single security.
• Traders assumed to have only long, neutral, short positions with constant magnetude: $F_t = \{1, 0, -1\}$.
• Price serie denoted $z_t$.
• Position $F_t$ is established (or maintained) at the end of each time interval t ⇒trade is possible at the end of each time period.
• Return $R_t$ is realized at the end of the interval $(t-1, t]$ and includes:
1. The profit/loss resulting from held position $F_{t-1}$
2. The transaction cost incurred at t due to difference between $F_{t-1}$ and $F_t$.
• Trader must have internal state information and must therefore be recurrent.
• We use the following decision function:

$$$\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},\ldots; y_t, y_{t-1}, y_{t-2}, \ldots)\end{split}$$\label{eq1}$

• With:
• $\theta_t$ : learned system parameters at time t.
• $I_t$ : information set at time t.
• $z_t$ : price serie.
• $y_t$ : external variables.
• Simple {long, short} trader example with m+1 autoregressive inputs:

$$$F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}$$$

• Where $r_t = z_t - z_{t-1}$ are the price returns of $z_t$ and the parameters are $\theta = \{u, v_i, w\}$.
• This is a discrete-action, deterministic trader.

#### Continuous function generalization

• We can use a continuously valued F() by replacing sign with tanh.
• $F_t = {1,0,-1}$ is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.

#### Stochastic framework generalization

• Model can be extended introducing a noise var in F(): $$$F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)$$\label{eq_Ft}$.
• Noise level controls “exploration vs exploitation” behavior.

### B. Profit and wealth for trading systems

• Trading systems optimized by maximazing performance function U()

• If each trade is of fixed size.
• $r_t = z_t - z_{t-1}$: price returns of risky asset.
• $r_t^f = z_t^f - z_{t-1}^f$ : price returns of risk-free asset (liek T-Bills)
• Transaction cost rate: $\delta$
• Trading position size: $\mu > 0$
• Additive profits accumulated over T periods:

$P_T = \sum\limits_{t=1}^T R_t$

• Where: $R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}$
• Usually we consider $P_0 = 0$ and $F_T = F_0 = 0$
• When ignoring risk free rate of interest (eg. $r_t^f = 0$), we have:

$$$R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}$$\label{eq_Rt_add}$

• The wealth of the trader is defined as: $W_T = W_0 + P_T$.

#### Multiplicative profits

• Applicable if fixed fraction of accumulated wealth $\nu > 0$ is invested in each trade.
• We use: $r_t = \frac{z_t}{z_{t-1}} - 1$ and $r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1$
• If we assume no short sale and $\nu = 1$ then the wealth at T is:

$W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)$

• Where: $(1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)$
• When ignoring risk free rate of interest (eg. $r_t^f = 0$), we have:

$$$(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)$$\label{eq_Rt_mult}$

### C. Performance criteria

• We consider perfomance criteria as function of the wealth: $U(W_T)$
• Or more generally: $U(W_T, \dots, W_1, W_0)$
• In both case U can be expressed as function of trading returns: $U(R_T,\dots,R_1,W_0)$ which we denote as $U_T$
• For trader optimization we are interested in the marginal increase in performance due to return $R_t$:

$$$D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}$$$

• We call $D_t$ the differential performance criteria.
• Note that $U_{t-1}$ doesn't depend on $R_t$

### D. Differential Sharpe Ratio

• Sharpe ratio is defined as: $S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}$
• With $\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t$
• Differential Sharpe Ratio is obtained by considering exponantial moving average of returns and standard deviations and expanding to first order in the adaptation rate $\eta$:

$A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})$ $B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})$

• Then we have $S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}$, and:

\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}

• A zero adaptation correspond to an infinite time average.
• Thus expanding about $\eta=0$ will correspond to “just turning on” the adaptation.
• We define the Differential Sharpe ratio as:

$D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}$

• $D_t$ describes the influence of trading return $R_t$ on the Sharpe Ratio $S_t$.
• Since $S_{t-1}$ doesn't depend on $R_t$, when we take $S_t$ as utility function we get:

$\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}$

• Problem is: with the usage of variance of $R_t^2$ there is no distinction between upside and downside risk. Thus assuming $A_{t-1} > 0$ the largest improvement occurs when $R_t^* = \frac{B_{t-1}}{A_{t-1}}$
• ⇒ The Sharpe ratio criteria will penalize larger gains.

### E. Downside risk

• Variance is more and more considered as an inadequate measure due to previously mentioned issue.
• Other options are:
1. Downside Deviation (DD)
2. Second Lower Partial Moment (SLPM)
3. Nth Lower Partial Moment
4. Sterling Ratio define as : $\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}$, where the max drawn Down is relative to a standard period (usually 1 to 3 years).
• Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).
• Downside Deviation defined as: $DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12$
• We the define our utility function, the Downside Deviation Ratio (DDR):

$DDR_T = \frac{\text{Average}(R_t)}{DD_T}$

• Next we define the exponential moving average (EMA) of returns and $DD_t^2$:

$A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)$

• And we use the exponantial moving version of DDR_t: $DDR_t = \frac{A_t}{DD_t}$
• We consider the first order expansion in adaptation rate $\eta$:

$DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)$

• And we can then define the Differential Downside Deviation Ratio:

\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}

• Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
• Accomplished with trial and errors exploration of the environment and space of strategies.
• Supervised learning is effective for structural credit assignment issue, not for temporal credit assignment.
• Structural credit assignment: assign credits to the parameters of a problem.
• Temporal credit assignment: assign credits to the individual actions taken over time.
• ⇒ Reinforcement learning tries to solve both problems at the same time.

### A. Recurrent Reinforcement Learning

• Given a trading system model $F_t(\theta)$, the goal is to adjust the parameters $\theta$ in order to maximize $U_T$
• For traders of form $\eqref{eq1}$ and trading returns of form $\eqref{eq_Rt_add}$ or $\eqref{eq_Rt_mult}$ the gradient of $U_T$ with respect to the parameters $\theta$ of the system after a sequence of T periods is:

$$$\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}$$$

• The system can be optimized in batch mode: repeatedly computing the value of $U_T$ on forward passes and adjusting the parameters by using gradient ascent (with learning rate $\rho$):

$$$\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}$$$

• Note that the quantities $dF_t/d\theta$ are total derivatives, so we need an approache similar to Back-Propagation Through Time (BPTT), thus: $$$\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}$$$
• We assume here differentiability of $F_t$. For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
• Previous equations $\eqref{eq1_batch_rrl}$, $\eqref{eq2_batch_rrl}$ and $\eqref{eq3_batch_rrl}$ constitute the batch RRL algorithm.
• There are 2 ways to extend this batch algorithm into a stochastic framework:
1. Exploration of strategy space can be induced by incorporating a noise variable $\epsilon_t$ (as in $\eqref{eq_Ft}$). In that case:
1. trade-off between exploration of the strategy space and exploitation of a learned policy can be controlled by the amplitude of the noise variance $\sigma_\epsilon$.
2. The noise magnetude can be annealed over time to arrive at a good strategy.
2. A simple online stochatic optimization can be obtained by considering only the term in $\eqref{eq1_batch_rrl}$ that depends on the most recently realized return $R_t$ (during the forward pass):

$$$\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}$$$

• The parameters are then updated online:

$$$\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}$$$

• This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data.
• The stochastic online analog to $\eqref{eq3_batch_rrl}$ is: $$$\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}$$$
• Equations $\eqref{eq1_online_rrl}$, $\eqref{eq2_online_rrl}$ and $\eqref{eq3_online_rrl}$ constitute the stochastic (or adaptative) RRL algorithm.
• ⇒ This is a reinforcement algorithm closely related to recurrent supervised algorithms such as Real Time Recurrent Learning (RTRL) and Dynamic Backpropagation
• When considering differential performance criteria $D_t$ as described in $\eqref{eq_Dt}$, the stochastic update equations become:

$$$\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}$$$

• Note that for financial data adding a noise variable $\epsilon_t$ doesn't provide any significant advantage since the input data already contain significant noise.

### B. Value functions and Q-Learning

Provide proper references on this chapter ?

## IV. Empirical Results

• Using 3 test cases:
1. Artificial prices series (using Sharpe ratio)
2. Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
3. Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.

• Using RRL trader taking {long, short} positions, with a state similar to $\eqref{eq_2states_trader}$.
• Experiment demonstrate that:
1. RRL is an effective mean of learning trading strategies
2. Trading frequency is reduced as expected as transaction costs increase.

#### A.1 Data

• Generating log price series as random walks with autoregressive trend processes:

$$$\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}$$$

• Where $\alpha$ and k are constants and $\epsilon(t)$ and $\nu(t)$ are normal random deviates with zero mean and unit variance: $\epsilon(t) \sim \mathcal{N}(0,1)$ and $\nu(t) \sim \mathcal{N}(0,1)$
• Artificial prices then defined as: $z(t) = exp\left(\frac{p(t)}{R}\right)$, where $R = max(p(t)) - min(p(t))$.
• Experiments were done with 10000 samples and $\alpha = 0.9$ and $k = 3$

• Input at time t constructed from the previous 8 returns.
• Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
• Transaction cost fixed at 0.5% during learning and trading
• Transient effects of initial learning visible during the first 2000 time steps.
• In these simulations the 10000 samples are partitioned in:
1. 1000 samples training set
2. 9000 samples test set.
• Traders are first optimized on the training data set for 100 epochs
• Then adapted online on the test data set.
• In 100 experiments, positive Sharpe ratio are always obtained.
• Ad as expected trading frequency is reduced as transaction costs increase.

### B. US Dollar/British Pound Foreign exchange trading system

• Using half-hourly USDGBP security.
• Training a {long,short,neutral} trading system.
• Training to maximize the differential Downside Deviation Ratio.
• System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
• Using a EMA Sharpe Ratio with time constant of 0.01.

### C. S&P 500 / T-Bill Asset Allocation

Provide proper references on this chapter ?
• RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01)
• Sensitivity of input is is defined as : $S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}$

## V. Learn the Policy or Learn the Value ?

Provide proper references on this chapter ?

## VI. Conclusions

Provide proper references on this chapter ?