====== Learning to Trade via Direct Reinforcement ======

Authors: John Moody & Matthew Saffel\\ Date: 2001

/* This is a comment */

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

===== I. Introduction =====

  * Trader goal: optimize some measure of trading performance.
  * Investment performance depends upon sequences of interdependent decisions. => Path dependent.
  * RRL is an adaptative policy search that can learn an investment strategy online.
  * In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy.
  * Frequently used class of performance criteria: measures of risk-adjusted investment returns.
  * RRL can balance the accumulation of returns with the avoidance of risk.
  * We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL.

===== II. Trading systems and performance criteria =====

==== A. Structure of trading systems ====

  * We consider agents that trade **fixed position sizes** in a single security.
  * Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\).
  * Price serie denoted \(z_t\).
  * Position \(F_t\) is established (or maintained) **at the end** of each time interval t =>trade is possible at the end of each time period.
  * Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes:
    - The profit/loss resulting from held position \(F_{t-1}\)
    - The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\).
  * Trader must have internal state information and must therefore be **recurrent**.
  * We use the following decision function: 
\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\]

  * With:
    * \(\theta_t\) : learned system parameters at time t.
    * \(I_t\) : information set at time t.
    * \(z_t\) : price serie.
    * \(y_t\) : external variables.

  * Simple {long, short} trader example with m+1 autoregressive inputs:
\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]

  * Where \(r_t = z_t - z_{t-1}\) are the price returns of \(z_t\) and the parameters are \(\theta = \{u, v_i, w\}\).

  * This is a discrete-action, **deterministic** trader.

=== Continuous function generalization ===

  * We can use a continuously valued F() by replacing sign with tanh.
  * \(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading.

=== Stochastic framework generalization ===

  * Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\].
  * Noise level controls "exploration vs exploitation" behavior.

==== B. Profit and wealth for trading systems ====

  * Trading systems optimized by maximazing performance function U()

=== Additive profits ===

  * If each trade is of fixed size.
  * \(r_t = z_t - z_{t-1}\): price returns of risky asset.
  * \(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills)
  * Transaction cost rate: \(\delta\)
  * Trading position size: \(\mu > 0\)
  * Additive profits accumulated over T periods:
  \[P_T = \sum\limits_{t=1}^T R_t\]
  * Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\]
  * Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\)
  * When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:
\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]
  * The wealth of the trader is defined as: \(W_T = W_0 + P_T\).

=== Multiplicative profits ===

  * Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade.
  * We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\)
  * If we assume no short sale and \(\nu = 1\) then the wealth at T is:
\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]
  * Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\)
  * When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have:
\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]

==== C. Performance criteria ====

  * We consider perfomance criteria as function of the wealth: \(U(W_T)\)
  * Or more generally: \(U(W_T, \dots, W_1, W_0)\)
  * In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\)
  * For trader optimization we are interested in the **marginal increase in performance due to return \(R_t\)**:
\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]
  * We call \(D_t\) the differential performance criteria.
  * Note that \(U_{t-1}\) doesn't depend on \(R_t\)

==== D. Differential Sharpe Ratio ====

  * Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\)
  * With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\)

  * **Differential Sharpe Ratio** is obtained by considering exponantial moving average of returns and standard deviations and expanding  to first order in the adaptation rate \(\eta\):
\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\]
\[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]

  * Then we have \(S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}\), and:
\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\  & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]
  * A zero adaptation correspond to an infinite time average.
  * Thus expanding about \(\eta=0\) will correspond to "**just turning on**" the adaptation.
  * We define the **Differential Sharpe ratio** as:
\[D_t =  {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

  * \(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\).
  * Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get:
\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]


  * Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\)
  * => The Sharpe ratio criteria will penalize larger gains.

==== E. Downside risk ====

  * Variance is more and more considered as an inadequate measure due to previously mentioned issue.
  * Other options are:
    - Downside Deviation (DD)
    - Second Lower Partial Moment (SLPM)
    - Nth Lower Partial Moment
    - Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years).
  * Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively).

  * Downside Deviation defined as: \(DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12\)

  * We the define our utility function, the **Downside Deviation Ratio** (DDR):
\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]

  * Next we define the exponential moving average (EMA) of returns and \(DD_t^2\):
  \[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]

  * And we use the exponantial moving version of DDR_t: \(DDR_t = \frac{A_t}{DD_t}\)

  * We consider the first order expansion in adaptation rate \(\eta\):
\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]

  * And we can then define the **Differential Downside Deviation Ratio**:
\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\  & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]


===== III. Learning to Trade =====

  * Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system.
  * Accomplished with trial and errors exploration of the environment and space of strategies.

  * Supervised learning is effective for **structural credit assignment** issue, not for **temporal credit assignment**.
  * Structural credit assignment: assign credits to the parameters of a problem.
  * Temporal credit assignment: assign credits to the individual actions taken over time.
  * => Reinforcement learning tries to solve both problems at the same time.

==== A. Recurrent Reinforcement Learning ====

  * Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\)
  * For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is:
  \[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]

  * The system can be optimized in batch mode: repeatedly computing the value of \(U_T\) on forward passes and adjusting the parameters by using gradient ascent (with learning rate \(\rho\)):

\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]
  * Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to **Back-Propagation Through Time** (BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\]

  * We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule).
  * Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the **batch RRL algorithm**.

  * There are 2 ways to extend this batch algorithm into a stochastic framework:
    - Exploration of strategy space can be induced by incorporating a noise variable \(\epsilon_t\) (as in \(\eqref{eq_Ft}\)). In that case:
      - trade-off between **exploration** of the strategy space and **exploitation** of a learned policy can be controlled by the amplitude of the noise variance \(\sigma_\epsilon\).
      - The noise magnetude can be annealed over time to arrive at a good strategy.
    - A simple online stochatic optimization can be obtained by considering only the term in \(\eqref{eq1_batch_rrl}\) that depends on the most recently realized return \(R_t\) (during the forward pass): 
\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]

  * The parameters are then updated online:
\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]

  * This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data.

  * The stochastic online analog to \(\eqref{eq3_batch_rrl}\) is: \[\begin{equation}\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}\end{equation}\]

  * Equations \(\eqref{eq1_online_rrl}\), \(\eqref{eq2_online_rrl}\) and \(\eqref{eq3_online_rrl}\) constitute the **stochastic (or adaptative) RRL algorithm**.

  * => This is a reinforcement algorithm closely related to recurrent supervised algorithms such as **Real Time Recurrent Learning (RTRL)** and **Dynamic Backpropagation**

  * When considering differential performance criteria \(D_t\) as described in \(\eqref{eq_Dt}\), the stochastic update equations become:

\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]

  * Note that for financial data adding a noise variable \(\epsilon_t\) doesn't provide any significant advantage since the input data already contain significant noise.

==== B. Value functions and Q-Learning ====

<note todo>Provide proper references on this chapter ?</note>

===== IV. Empirical Results =====

  * Using 3 test cases:
    - Artificial prices series (using Sharpe ratio)
    - Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio)
    - Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index.

==== A. Trader Simulation ====

  * Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\).
  * Experiment demonstrate that:
    - RRL is an effective mean of learning trading strategies
    - Trading frequency is reduced as expected as transaction costs increase.

=== A.1 Data ===

  * Generating log price series as random walks with autoregressive trend processes:
\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]

  * Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\)

  * Artificial prices then defined as: \(z(t) = exp\left(\frac{p(t)}{R}\right)\), where \(R = max(p(t)) - min(p(t))\).

  * Experiments were done with 10000 samples and \(\alpha = 0.9\) and \(k = 3\)

=== A.2 Simulated Trading Results ===

  * Input at time t constructed from the previous 8 returns.
  * RRL trader initialized randomly
  * Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio.
  * Transaction cost fixed at 0.5% during learning and trading
  * Transient effects of initial learning visible during the first 2000 time steps.
  * In these simulations the 10000 samples are partitioned in:
    - 1000 samples training set
    - 9000 samples test set.
  * Traders are first optimized on the training data set for 100 epochs
  * Then adapted online on the test data set.

  * In 100 experiments, positive Sharpe ratio are always obtained.
  * Ad as expected trading frequency is reduced as transaction costs increase.

==== B. US Dollar/British Pound Foreign exchange trading system ====

  * Using half-hourly USDGBP security.
  * Training a {long,short,neutral} trading system.
  * Trading system incuring transaction cost from bid-ask spread
  * Training to maximize the differential Downside Deviation Ratio.
  * System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained.
  * Using a EMA Sharpe Ratio with time constant of 0.01.

==== C. S&P 500 / T-Bill Asset Allocation ====

<note todo>Provide proper references on this chapter ?</note>

  * RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01)

  * Sensitivity of input is is defined as : \(S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}\)

===== V. Learn the Policy or Learn the Value ? =====

<note todo>Provide proper references on this chapter ?</note>

===== VI. Conclusions =====

<note todo>Provide proper references on this chapter ?</note>