====== Learning to Trade via Direct Reinforcement ====== Authors: John Moody & Matthew Saffel\\ Date: 2001 /* This is a comment */ Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning. ===== I. Introduction ===== * Trader goal: optimize some measure of trading performance. * Investment performance depends upon sequences of interdependent decisions. => Path dependent. * RRL is an adaptative policy search that can learn an investment strategy online. * In financial problems we can use Direct Reinforcement approach to provide immediate feedback to optimize a strategy. * Frequently used class of performance criteria: measures of risk-adjusted investment returns. * RRL can balance the accumulation of returns with the avoidance of risk. * We formulate here the differential forms of the Sharpe ratio and Downside Deviation Ratio for efficient online learning with RRL. ===== II. Trading systems and performance criteria ===== ==== A. Structure of trading systems ==== * We consider agents that trade **fixed position sizes** in a single security. * Traders assumed to have only long, neutral, short positions with constant magnetude: \(F_t = \{1, 0, -1\}\). * Price serie denoted \(z_t\). * Position \(F_t\) is established (or maintained) **at the end** of each time interval t =>trade is possible at the end of each time period. * Return \(R_t\) is realized at the end of the interval \((t-1, t]\) and includes: - The profit/loss resulting from held position \(F_{t-1}\) - The transaction cost incurred at t due to difference between \(F_{t-1}\) and \(F_t\). * Trader must have internal state information and must therefore be **recurrent**. * We use the following decision function: \[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\] * With: * \(\theta_t\) : learned system parameters at time t. * \(I_t\) : information set at time t. * \(z_t\) : price serie. * \(y_t\) : external variables. * Simple {long, short} trader example with m+1 autoregressive inputs: \[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\] * Where \(r_t = z_t - z_{t-1}\) are the price returns of \(z_t\) and the parameters are \(\theta = \{u, v_i, w\}\). * This is a discrete-action, **deterministic** trader. === Continuous function generalization === * We can use a continuously valued F() by replacing sign with tanh. * \(F_t = {1,0,-1}\) is not differentiable. But we may still apply gradient optimization by considering differentiable pre-threshold outputs or replacing sign with tanh during learning and discretizing when trading. === Stochastic framework generalization === * Model can be extended introducing a noise var in F(): \[\begin{equation}F_t = F(\theta_t; F_{t-1}, I_t; \epsilon_t) \text{ with } \epsilon_t \sim p_\epsilon(\epsilon)\end{equation}\label{eq_Ft}\]. * Noise level controls "exploration vs exploitation" behavior. ==== B. Profit and wealth for trading systems ==== * Trading systems optimized by maximazing performance function U() === Additive profits === * If each trade is of fixed size. * \(r_t = z_t - z_{t-1}\): price returns of risky asset. * \(r_t^f = z_t^f - z_{t-1}^f\) : price returns of risk-free asset (liek T-Bills) * Transaction cost rate: \(\delta\) * Trading position size: \(\mu > 0\) * Additive profits accumulated over T periods: \[P_T = \sum\limits_{t=1}^T R_t\] * Where: \[R_t = \mu \cdot \{r_t^f + F_{t-1} \cdot (r_t - r_t^f) - \delta \cdot |F_t - F_{t-1}|\}\] * Usually we consider \(P_0 = 0\) and \(F_T = F_0 = 0\) * When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have: \[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\] * The wealth of the trader is defined as: \(W_T = W_0 + P_T\). === Multiplicative profits === * Applicable if fixed fraction of accumulated wealth \(\nu > 0\) is invested in each trade. * We use: \(r_t = \frac{z_t}{z_{t-1}} - 1\) and \(r_t^f = \frac{z_t^f}{z_{t-1}^f} - 1\) * If we assume no short sale and \(\nu = 1\) then the wealth at T is: \[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\] * Where: \((1+Rt) = \{1 + (1 - F_{t-1}) \cdot r_t^f + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\) * When ignoring risk free rate of interest (eg. \(r_t^f = 0\)), we have: \[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\] ==== C. Performance criteria ==== * We consider perfomance criteria as function of the wealth: \(U(W_T)\) * Or more generally: \(U(W_T, \dots, W_1, W_0)\) * In both case U can be expressed as function of trading returns: \(U(R_T,\dots,R_1,W_0)\) which we denote as \(U_T\) * For trader optimization we are interested in the **marginal increase in performance due to return \(R_t\)**: \[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\] * We call \(D_t\) the differential performance criteria. * Note that \(U_{t-1}\) doesn't depend on \(R_t\) ==== D. Differential Sharpe Ratio ==== * Sharpe ratio is defined as: \(S_T = \frac{\text{Average}(R_t)}{\text{Std deviation}(R_t)} = \frac{\bar{R}}{(\frac 1T \sum_{t=1}^T R_t^2 - \bar{R}^2)^\frac 12}\) * With \(\bar{R} = \frac 1T \sum\limits_{t=1}^T R_t\) * **Differential Sharpe Ratio** is obtained by considering exponantial moving average of returns and standard deviations and expanding to first order in the adaptation rate \(\eta\): \[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\] \[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\] * Then we have \(S_t = \frac{A_t}{\sqrt{B_t - A_t^2}}\), and: \[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\] * A zero adaptation correspond to an infinite time average. * Thus expanding about \(\eta=0\) will correspond to "**just turning on**" the adaptation. * We define the **Differential Sharpe ratio** as: \[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\] * \(D_t\) describes the influence of trading return \(R_t\) on the Sharpe Ratio \(S_t\). * Since \(S_{t-1}\) doesn't depend on \(R_t\), when we take \(S_t\) as utility function we get: \[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\] * Problem is: with the usage of variance of \(R_t^2\) there is no distinction between upside and downside risk. Thus assuming \(A_{t-1} > 0\) the largest improvement occurs when \(R_t^* = \frac{B_{t-1}}{A_{t-1}}\) * => The Sharpe ratio criteria will penalize larger gains. ==== E. Downside risk ==== * Variance is more and more considered as an inadequate measure due to previously mentioned issue. * Other options are: - Downside Deviation (DD) - Second Lower Partial Moment (SLPM) - Nth Lower Partial Moment - Sterling Ratio define as : \(\text{Sterling Ratio} = \frac{\text{Annualized Average Return}}{\text{Max. Draw-Down}}\), where the max drawn Down is relative to a standard period (usually 1 to 3 years). * Max draw Down is cumbersome to minimize, thus we focus on the Downside Deviation (which tracks the Sterling Ratio effectively). * Downside Deviation defined as: \(DD_T = \left( \frac 1T \sum\limits_{t=1}^T min(R_t,0)^2\right)^\frac 12\) * We the define our utility function, the **Downside Deviation Ratio** (DDR): \[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\] * Next we define the exponential moving average (EMA) of returns and \(DD_t^2\): \[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\] * And we use the exponantial moving version of DDR_t: \(DDR_t = \frac{A_t}{DD_t}\) * We consider the first order expansion in adaptation rate \(\eta\): \[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\] * And we can then define the **Differential Downside Deviation Ratio**: \[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\] ===== III. Learning to Trade ===== * Reinforcement learning adjusts the parameters of a system to maximize the expected payoff that is generated due to the actions of the system. * Accomplished with trial and errors exploration of the environment and space of strategies. * Supervised learning is effective for **structural credit assignment** issue, not for **temporal credit assignment**. * Structural credit assignment: assign credits to the parameters of a problem. * Temporal credit assignment: assign credits to the individual actions taken over time. * => Reinforcement learning tries to solve both problems at the same time. ==== A. Recurrent Reinforcement Learning ==== * Given a trading system model \(F_t(\theta)\), the goal is to adjust the parameters \(\theta\) in order to maximize \(U_T\) * For traders of form \(\eqref{eq1}\) and trading returns of form \(\eqref{eq_Rt_add}\) or \(\eqref{eq_Rt_mult}\) the gradient of \(U_T\) with respect to the parameters \(\theta\) of the system after a sequence of T periods is: \[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\] * The system can be optimized in batch mode: repeatedly computing the value of \(U_T\) on forward passes and adjusting the parameters by using gradient ascent (with learning rate \(\rho\)): \[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\] * Note that the quantities \(dF_t/d\theta\) are total derivatives, so we need an approache similar to **Back-Propagation Through Time** (BPTT), thus: \[\begin{equation}\frac{dF_t}{d\theta} = \frac{\partial F_t}{\partial \theta} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta}\label{eq3_batch_rrl}\end{equation}\] * We assume here differentiability of \(F_t\). For long/short traders with thresholds the reinforcement signal can be backpropagated through the pre-thresholded outputs (similar to Adaline learning rule). * Previous equations \(\eqref{eq1_batch_rrl}\), \(\eqref{eq2_batch_rrl}\) and \(\eqref{eq3_batch_rrl}\) constitute the **batch RRL algorithm**. * There are 2 ways to extend this batch algorithm into a stochastic framework: - Exploration of strategy space can be induced by incorporating a noise variable \(\epsilon_t\) (as in \(\eqref{eq_Ft}\)). In that case: - trade-off between **exploration** of the strategy space and **exploitation** of a learned policy can be controlled by the amplitude of the noise variance \(\sigma_\epsilon\). - The noise magnetude can be annealed over time to arrive at a good strategy. - A simple online stochatic optimization can be obtained by considering only the term in \(\eqref{eq1_batch_rrl}\) that depends on the most recently realized return \(R_t\) (during the forward pass): \[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\] * The parameters are then updated online: \[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\] * This algorithms performs stochastic optimization since the systems parameters are varied during each forward pass though the training data. * The stochastic online analog to \(\eqref{eq3_batch_rrl}\) is: \[\begin{equation}\frac{dF_t}{d\theta_t} \approx \frac{\partial F_t}{\partial \theta_t} + \frac{\partial F_t}{\partial F_{t-1}} \frac {dF_{t-1}}{d\theta_{t-1}}\label{eq3_online_rrl}\end{equation}\] * Equations \(\eqref{eq1_online_rrl}\), \(\eqref{eq2_online_rrl}\) and \(\eqref{eq3_online_rrl}\) constitute the **stochastic (or adaptative) RRL algorithm**. * => This is a reinforcement algorithm closely related to recurrent supervised algorithms such as **Real Time Recurrent Learning (RTRL)** and **Dynamic Backpropagation** * When considering differential performance criteria \(D_t\) as described in \(\eqref{eq_Dt}\), the stochastic update equations become: \[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\] * Note that for financial data adding a noise variable \(\epsilon_t\) doesn't provide any significant advantage since the input data already contain significant noise. ==== B. Value functions and Q-Learning ==== Provide proper references on this chapter ? ===== IV. Empirical Results ===== * Using 3 test cases: - Artificial prices series (using Sharpe ratio) - Half-hourly US Dollar/British Pound (USBGBP) exchange rate (using Downside Deviation Ratio) - Comparaison of RRL and Q-Learning on the monthly S&P 500 stock index. ==== A. Trader Simulation ==== * Using RRL trader taking {long, short} positions, with a state similar to \(\eqref{eq_2states_trader}\). * Experiment demonstrate that: - RRL is an effective mean of learning trading strategies - Trading frequency is reduced as expected as transaction costs increase. === A.1 Data === * Generating log price series as random walks with autoregressive trend processes: \[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\] * Where \(\alpha\) and k are constants and \(\epsilon(t)\) and \(\nu(t)\) are normal random deviates with zero mean and unit variance: \(\epsilon(t) \sim \mathcal{N}(0,1)\) and \(\nu(t) \sim \mathcal{N}(0,1)\) * Artificial prices then defined as: \(z(t) = exp\left(\frac{p(t)}{R}\right)\), where \(R = max(p(t)) - min(p(t))\). * Experiments were done with 10000 samples and \(\alpha = 0.9\) and \(k = 3\) === A.2 Simulated Trading Results === * Input at time t constructed from the previous 8 returns. * RRL trader initialized randomly * Trader adapted using real-time recurrent learning to optimize the differential Sharpe ratio. * Transaction cost fixed at 0.5% during learning and trading * Transient effects of initial learning visible during the first 2000 time steps. * In these simulations the 10000 samples are partitioned in: - 1000 samples training set - 9000 samples test set. * Traders are first optimized on the training data set for 100 epochs * Then adapted online on the test data set. * In 100 experiments, positive Sharpe ratio are always obtained. * Ad as expected trading frequency is reduced as transaction costs increase. ==== B. US Dollar/British Pound Foreign exchange trading system ==== * Using half-hourly USDGBP security. * Training a {long,short,neutral} trading system. * Trading system incuring transaction cost from bid-ask spread * Training to maximize the differential Downside Deviation Ratio. * System initially training on 2000 data points, then producing signals for 2 weeks (480 points), then the training window is shifted (to include the just tested 480 points) and the system is re-trained. * Using a EMA Sharpe Ratio with time constant of 0.01. ==== C. S&P 500 / T-Bill Asset Allocation ==== Provide proper references on this chapter ? * RRL trader using a single tanh unit and regularized using quadratic weight decay during training (regularization parameter: 0.01) * Sensitivity of input is is defined as : \(S_i = \frac{|\frac{dF}{dx_i}|}{max_j |\frac{dF}{dx_j}|}\) ===== V. Learn the Policy or Learn the Value ? ===== Provide proper references on this chapter ? ===== VI. Conclusions ===== Provide proper references on this chapter ?