Table of Contents

Learning to Trade via Direct Reinforcement

Authors: John Moody & Matthew Saffel
Date: 2001

Presenting an adaptative algorithm: Recurrent Reinforcement Learning (RRL) which differs from TD-Learning and Q-Learning.

I. Introduction

II. Trading systems and performance criteria

A. Structure of trading systems

\[\begin{equation}\begin{split}F_t & = F(\theta_t; F_{t-1}, I_t) \\ I_t & = (z_t,z_{t-1},z_{t-2},...; y_t, y_{t-1}, y_{t-2}, ...)\end{split}\end{equation}\label{eq1}\]

\[\begin{equation}F_t = sign(u \cdot F_{t-1} + v_1 \cdot r_t + v_2 \cdot r_{t-1} + \cdots + v_m \cdot r_{t-m} + w)\label{eq_2states_trader}\end{equation}\]

Continuous function generalization

Stochastic framework generalization

B. Profit and wealth for trading systems

Additive profits

\[P_T = \sum\limits_{t=1}^T R_t\]

\[\begin{equation}R_t = \mu \cdot \{F_{t-1} \cdot r_t - \delta \cdot |F_t - F_{t-1}|\}\end{equation}\label{eq_Rt_add}\]

Multiplicative profits

\[W_T = W_0 \cdot \prod\limits_{t=1}^T(1+R_t)\]

\[\begin{equation}(1+Rt) = \{1 + F_{t-1} \cdot r_t\} \times (1 - \delta \cdot |F_t - F_{t-1}|)\end{equation}\label{eq_Rt_mult}\]

C. Performance criteria

\[\begin{equation}D_t \varpropto \Delta U_t = U_t - U_{t-1}\label{eq_Dt}\end{equation}\]

D. Differential Sharpe Ratio

\[A_t = A_{t-1} + \eta \Delta A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1})\] \[B_t = B_{t-1} + \eta \Delta B_t = B_{t-1} + \eta \cdot (R_t^2 - B_{t-1})\]

\[\begin{align*}{S_t}_{|\eta>0} & \approx {S_t}_{|\eta=0} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2) \\ & \approx S_{t-1} + \eta {\frac{dS_t}{d\eta}}_{|\eta=0} + O(\eta^2)\end{align*}\]

\[D_t = {\frac{dS_t}{d\eta}}_{|\eta=0} = \frac{B_{t-1} \cdot \Delta A_t - \frac 12 A_{t-1} \Delta B_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

\[\frac{dU_t}{dR_t} = \frac{dS_t}{dR_t} \approx \eta \frac{dD_t}{dR_t} \\ \text{with: } \frac{dD_t}{dR_t} = \frac{B_{t-1} - A_{t-1} \cdot R_t}{(B_{t-1} - A_{t-1}^2)^\frac 32}\]

E. Downside risk

\[DDR_T = \frac{\text{Average}(R_t)}{DD_T}\]

\[A_t = A_{t-1} + \eta \cdot (R_t - A_{t-1}) \\ DD_t^2 = DD_{t-1}^2 + \eta \cdot (min(R_t,0)^2 - DD_{t-1}^2)\]

\[DDR_t \approx DDR_{t-1} + \eta {\frac{dDDR_t}{d\eta}}_{|\eta=0} + O(\eta^2)\]

\[\begin{align*}D_t & \equiv {\frac{dDDR_t}{d\eta}}_{|\eta=0} \\ & = \frac{R_t - \frac 12 A_{t-1}}{DD_{t-1}} \text{, when } R_t > 0 \\ & = \frac{DD_{t-1}^2 \cdot (R_t - \frac 12 A_{t-1}) - \frac 12 A_{t-1} \cdot R_t^2}{DD_{t-1}^3} \text{, when } R_t \le 0\end{align*}\]

III. Learning to Trade

A. Recurrent Reinforcement Learning

\[\begin{equation}\frac{dU_T(\theta)}{d\theta} = \sum\limits_{t=1}^T \frac{dU_T}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_batch_rrl}\end{equation}\]

\[\begin{equation}\Delta \theta = \rho \frac{dU_T(\theta)}{d\theta}\label{eq2_batch_rrl}\end{equation}\]

\[\begin{equation}\frac{dU_t(\theta)}{d\theta} \approx \frac{dU_t}{dR_t} \left \{ \frac{dR_t}{dF_t} \frac{dF_t}{d\theta} + \frac{dR_t}{dF_{t-1}} \frac{dF_{t-1}}{d\theta} \right \}\label{eq1_online_rrl}\end{equation}\]

\[\begin{equation}\Delta \theta_t = \rho \frac{dU_t(\theta_t)}{d\theta_t}\label{eq2_online_rrl}\end{equation}\]

\[\begin{equation}\begin{split}\Delta\theta_t & = \rho \frac{dD_t(\theta_t)}{d\theta_t}\\ & \approx \rho \frac{dD_t}{dR_t}\{\frac{dR_t}{dF_t}\frac{dF_t}{d\theta_t} + \frac{dR_t}{dF_{t-1}}\frac{dF_{t-1}}{d\theta_{t-1}}\}\end{split}\end{equation}\]

B. Value functions and Q-Learning

Provide proper references on this chapter ?

IV. Empirical Results

A. Trader Simulation

A.1 Data

\[\begin{equation}\begin{split}p(t) & = p(t-1) + \beta(t-1) + k \epsilon(t)\\\beta(t) & = \alpha \beta(t-1) + \nu(t)\end{split}\end{equation}\]

A.2 Simulated Trading Results

B. US Dollar/British Pound Foreign exchange trading system

C. S&P 500 / T-Bill Asset Allocation

Provide proper references on this chapter ?

V. Learn the Policy or Learn the Value ?

Provide proper references on this chapter ?

VI. Conclusions

Provide proper references on this chapter ?