Learning to Trade via Direct Reinforcement

Authors: John Moody & Matthew Saffel
Date: 2001

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

  • Trader goal: optimize some measure of trading performance.
  • Investment performance depends upon sequences of interdependent decisions. ⇒ Path dependent.
  • RRL is an adaptative policy search that can learn an investment strategy online.
  • In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
  • Frequently used class of performance criteria: measures of risk-adjusted investment returns.
  • RRL can balance the accumulation of returns with the avoidance of risk.
  • We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.
  • We consider agents that trade fixed position sizes in a single security.
  • Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\).
  • Price serie denoted \(z_t\).
  • Position \(F_t\) is established (or maintained) at the end of each time interval t ⇒trade is possible at the end of each time period.
  • Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes:
    1. The profit/loss resulting from held position \(F_{t-1}\)
    2. The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\).
  • Trader must have internal state information and must therefore be recurrent.
  • We use the following decision function:

\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\]

  • With:
    • \(\theta_t\) : learned system parameters at time t.
    • \(I_t\) : information set at time t.
    • \(z_t\) : price serie.
    • \(y_t\) : external variables.
  • Simple {long, short} trader example with m+1 autoregressive inputs:

\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]

  • Where \(r_t = z_t - z_{t-1}\) are the price returns of \(z_t\) and the parameters are \(\theta = \{u, v_i, w\}\).
  • This is a discrete-action, deterministic trader.

Continuous function generalization

  • We can use a continuously valued F() by replacing sign with tanh.
  • \(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.

Stochastic framework generalization

  • Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\].
  • Noise level controls “exploration vs exploitation” behavior.
  • Trading systems optimized by maximazing performance function U()

Additive profits

  • If each trade is of fixed size.
  • \(r_t = z_t - z_{t-1}\): price returns of risky asset.
  • \(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills)
  • Transaction cost rate: \(\delta\)
  • Trading position size: \(\mu > 0\)
  • Additive profits accumulated over T periods:

\[P_T = \sum\limits_{t=1}^T R_t\]

  • Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\]
  • Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\)
  • When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]

  • The wealth of the trader is defined as: \(W_T = W_0 + P_T\).

Multiplicative profits

  • Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade.
  • We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\)
  • If we assume no short sale and \(\nu = 1\) then the wealth at T is:

\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]

  • Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\)
  • When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:

\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]

  • We consider perfomance criteria as function of the wealth: \(U(W_T)\)
  • Or more generally: \(U(W_T, \dots, W_1, W_0)\)
  • In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\)
  • For trader optimization we are interested in the marginal increase in performance due to return \(R_t\):

\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]

  • We call \(D_t\) the differential performance criteria.
  • Note that \(U_{t-1}\) doesn't depend on \(R_t\)
  • Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\)
  • With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\)
  • Differential Sharpe Ratio is obtained by considering exponantial moving average of returns and standard deviations and expanding to first order in the adaptation rate \(\eta\):

\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\] \[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]

  • Then we have \(S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}\), and:

\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]

  • A zero adaptation correspond to an infinite time average.
  • Thus expanding about \(\eta=0\) will correspond to “just turning on” the adaptation.
  • We define the Differential Sharpe ratio as:

\[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

  • \(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\).
  • Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get:

\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

  • Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\)
  • ⇒ The Sharpe ratio criteria will penalize larger gains.
  • Variance is more and more considered as an inadequate measure due to previously mentioned issue.
  • Other options are:
    1. Downside Deviation (DD)
    2. Second Lower Partial Moment (SLPM)
    3. Nth Lower Partial Moment
    4. Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years).
  • Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).
  • Downside Deviation defined as: \(DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12\)
  • We the define our utility function, the Downside Deviation Ratio (DDR):

\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]

  • Next we define the exponential moving average (EMA) of returns and \(DD_t^2\):

\[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]

  • And we use the exponantial moving version of DDR_t: \(DDR_t = \frac{A_t}{DD_t}\)
  • We consider the first order expansion in adaptation rate \(\eta\):

\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]

  • And we can then define the Differential Downside Deviation Ratio:

\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]

  • Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
  • Accomplished with trial and errors exploration of the environment and space of strategies.
  • Supervised learning is effective for structural credit assignment issue, not for temporal credit assignment.
  • Structural credit assignment: assign credits to the parameters of a problem.
  • Temporal credit assignment: assign credits to the individual actions taken over time.
  • ⇒ Reinforcement learning tries to solve both problems at the same time.
  • Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\)
  • For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is:

\[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]

  • The system can be optimized in batch mode: repeatedly computing the value of \(U_T\) on forward passes and adjusting the parameters by using gradient ascent (with learning rate \(\rho\)):

\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]

  • Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to Back-Propagation Through Time (BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\]
  • We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
  • Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the batch RRL algorithm.
  • There are 2 ways to extend this batch algorithm into a stochastic framework:
    1. Exploration of strategy space can be induced by incorporating a noise variable \(\epsilon_t\) (as in \(\eqref{eq_Ft}\)). In that case:
      1. trade-off between exploration of the strategy space and exploitation of a learned policy can be controlled by the amplitude of the noise variance \(\sigma_\epsilon\).
      2. The noise magnetude can be annealed over time to arrive at a good strategy.
    2. A simple online stochatic optimization can be obtained by considering only the term in \(\eqref{eq1_batch_rrl}\) that depends on the most recently realized return \(R_t\) (during the forward pass):

\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]

  • The parameters are then updated online:

\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]

  • This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data.
  • The stochastic online analog to \(\eqref{eq3_batch_rrl}\) is: \[\begin{equation}\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}\end{equation}\]
  • Equations \(\eqref{eq1_online_rrl}\), \(\eqref{eq2_online_rrl}\) and \(\eqref{eq3_online_rrl}\) constitute the stochastic (or adaptative) RRL algorithm.
  • ⇒ This is a reinforcement algorithm closely related to recurrent supervised algorithms such as Real Time Recurrent Learning (RTRL) and Dynamic Backpropagation
  • When considering differential performance criteria \(D_t\) as described in \(\eqref{eq_Dt}\), the stochastic update equations become:

\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]

  • Note that for financial data adding a noise variable \(\epsilon_t\) doesn't provide any significant advantage since the input data already contain significant noise.
Provide proper references on this chapter ?
  • Using 3 test cases:
    1. Artificial prices series (using Sharpe ratio)
    2. Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
    3. Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.
  • Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\).
  • Experiment demonstrate that:
    1. RRL is an effective mean of learning trading strategies
    2. Trading frequency is reduced as expected as transaction costs increase.

A.1 Data

  • Generating log price series as random walks with autoregressive trend processes:

\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]

  • Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\)
  • Artificial prices then defined as: \(z(t) = exp\left(\frac{p(t)}{R}\right)\), where \(R = max(p(t)) - min(p(t))\).
  • Experiments were done with 10000 samples and \(\alpha = 0.9\) and \(k = 3\)

A.2 Simulated Trading Results

  • Input at time t constructed from the previous 8 returns.
  • RRL trader initialized randomly
  • Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
  • Transaction cost fixed at 0.5% during learning and trading
  • Transient effects of initial learning visible during the first 2000 time steps.
  • In these simulations the 10000 samples are partitioned in:
    1. 1000 samples training set
    2. 9000 samples test set.
  • Traders are first optimized on the training data set for 100 epochs
  • Then adapted online on the test data set.
  • In 100 experiments, positive Sharpe ratio are always obtained.
  • Ad as expected trading frequency is reduced as transaction costs increase.
  • Using half-hourly USDGBP security.
  • Training a {long,short,neutral} trading system.
  • Trading system incuring transaction cost from bid-ask spread
  • Training to maximize the differential Downside Deviation Ratio.
  • System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
  • Using a EMA Sharpe Ratio with time constant of 0.01.
Provide proper references on this chapter ?
  • RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01)
  • Sensitivity of input is is defined as : \(S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}\)
Provide proper references on this chapter ?
Provide proper references on this chapter ?
  • public/books/moody_2001_learning_to_trade_via_direct_reinforcement/intro.txt
  • Last modified: 2020/07/10 12:11
  • (external edit)