Online machine learning algorithms for currency exchange prediction

Authors: Eleftherios Soulas and Dennis Shasha.
Date: April 2013

Ongoing research on new type of Data Stream Management System (DBMS). Example of streaming database:
- Aurora, MAIDS, Niagara, Telegraph
Zhu and Shasha ¹⁾ using the first few coeffs of DFT transform to compute the distance between time series (eg. find the correlation).
Zhao and Shasha ²⁾ ³⁾ introduced a new sketch based data correlation strategy. Based on the Johnson-Lindenstrauss (JL) Lemma.
Infos on statstream at: http://cs.nyu.edu/shasha/papers/statstream.html

Goal: find correlation over windows from the same or different streams on synchronous and asynchronous (a.k.a. lagged) variations.
Synchronous correlation: Given \(N_s\) streams, a start time \(t_{start}\) and a window size w, find, for each time window W of size w, all pairs of streams S1 and S2 such that S1 during time window W is highly correlated (over 0.95) with S2 during the same time window.
Asynchronous correlation: allow for difference windows (with size w) W1 and W2.

Cooperative : time series that exhibit a substantial degree of regularity. Reduction techniques: Fourier transforms ⁴⁾ ⁵⁾ ⁶⁾, Wavelet transform ⁷⁾ ⁸⁾) , Singular Value Decomposition ⁹⁾ , Piecewise Constant Approximations ¹⁰⁾ .
Uncooperative: Time series with no such periodical regularities.

For uncooperative time series (for isntance returns of a stock), sketch-based approaches ¹¹⁾ can be used. We use random projection of the time series on random to exploit the bounds provided by the Johnson-Lindenstrauss lemma ¹²⁾.

Notions of time-point, basic window and sliding window in Statstream framework.
Benefits of basic window (bw): if asking correlations for \([t_i, t_{i+sw}]\), we get results at \([t_{i+bw},t_{i+sw+bw}]\).

Intuition of Sketch approach ¹³⁾ ¹⁴⁾ :
- Given a point \(x \in \mathbb{R}^m\), we compute its dot product with d random vectors \(r_i \in 1,-1^m\).
- The first random projection of x is given by \(y_1 = (x \cdot r_1, x \cdot r_2,\dots, x \cdot r_d)\)
- We compute 2b more such random projections, so that we get \(y_1, y_2,\dots,y_{2b+1}\).
- If w is another point \(w \in \mathbb{R}^m\), and \(z_1, z_2,\dots,z_{2b+1}\) are its projection using the same random vectors. Then the median of \(\|y_1 - z_1\|, \|y_2 - z_2\|, \dots, \|y_{2b+1} - z_{2b+1}\| \) is a good extimate of \(\|x - w\|\).
- The estimate lies within a \(\theta(\frac 1d)\) factor of \(\|x - w\|\) with probability \(1 - {\frac 12}^b\).

Sketch implementation in statstream uses a structured random vector approach, providing a 30 to 40 factor improvement in runtime.
Example:
- Build random vector \(r_{bw} = (r_0,r_1,\dots,r_{bw})\) where is \(r_i\) is 1 or -1 with probability 0.5.
- Build control vector \(b = (b_0,b_1,\dots,b_{nb})\) where is \(b_i\) is 1 or -1 with probability 0.5. \(nb = \frac{sw}{bw}\).
- Build the random vector for sliding window: \(r = (r_{bw}*b_0,r_{bw}*b_1,\dots,r_{bw}*b_{nb})\).

Statstream implementation uses 60 random vectors.
Pearson correlation is related to euclidian distance with: \(D^2(\hat{x},\hat{y}) = 2(1-corr(x,y))\).
Sketches work much better tha SVD method (see results in section 2.2.9).

Standard error is related to standard deviation by: \(SE_{\hat{x}} = \frac{std}{\sqrt{n}}\).

Theoretical background for this paper given in ¹⁵⁾ ¹⁶⁾ ¹⁷⁾ ¹⁸⁾ ¹⁹⁾ ²⁰⁾

Regressor trained on a sliding window \([t_i,t_{i+sw}]\) and predicting \(t_{i+sw+1}\) with a linear model.
We expect different outcomes with different window sizes: the window size that minimize the error is selected “off-line”.

Here we randomly select which samples of the dataset would be candidate for training the model ⇒ using exponential random picking.

Algorithm using warm start initialization ²¹⁾

Problem of “catastrophic interference” ²²⁾

Idea of adaptative learning rate for variants of stochastic gradient descent ²³⁾

In the algorithm here a fixed learning rate is enough because of the sliding window usage.

Absolute and relative error: \(\text{relative error}= \frac{\text{absolute error}}{\text{value of thing measured}}\)
Mean squared error: \(MSE(y,\hat{y}) = \frac 1{n_{samples}} \sum\limits_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2\)
Explained Variance score: \(ExpVar(y,\hat{y}) = 1 - \frac{Var\{y - \hat{y}\}}{Var\{y\}} \)
\(R^2\) Score (eg. coefficient of determination): \(R^2(y,\hat{y}) = 1 - \frac {\sum_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2}{Var\{y\}}\) (Best possible score is 1.0, lower values are worse).

Signals used in this algorithm for buy and sell:
- if \(bid\_prediction_{t+1} > ask_t\), then buy at time t and sell later.
- if \(ask\_prediction_{t+1} < bid_t\), then sell at time t and buy later.

¹⁾

Shasha, D., Zhu, Y.: High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science). SpringerVerlag (2004)

²⁾

Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05, New York, NY, USA, ACM (2005) 743–749

³⁾

Zhao, X.: High performance algorithms for multiple streaming time series. PhD thesis, New York, NY, USA (2006) AAI3205697

⁴⁾

Agrawal, R., Faloutsos, C., Swami, A.N.: Eﬃcient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. FODO ’93, London, UK, UK, Springer-Verlag (1993)

⁵⁾

Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. Technical report, College Park, MD, USA (1993)

⁶⁾

Raﬁei, D., Mendelzon, A.: Similarity-based queries for time series data. SIGMOD Rec. 26(2) (June 1997) 13–25

⁷⁾

Eﬃcient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering. ICDE ’99, Washington, DC, USA, IEEE Computer Society (1999)

⁸⁾

Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surﬁng wavelets on streams: One-pass summaries for approximate aggregate queries. In: In VLDB. (2001

⁹⁾

Korn, F., Jagadish, H.V., Faloutsos, C.: Eﬃciently supporting ad hoc queries in large datasets of time sequences. In: In SIGMOD. (1997) 289–300

¹⁰⁾

Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. JOURNAL OF KNOWLEDGE AND INFORMATION SYSTEMS 3 (2000) 263–286

¹¹⁾

Achlioptas, D.: Database-friendly random projections, ACM Press (2001) 274–281

¹²⁾

Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mappings into a hilbert space. (1982)

¹³⁾

Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Eﬃcient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. STOC ’98, New York, NY, USA, ACM (1998) 614–623

¹⁴⁾

Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53(3) (May 2006) 307–323

¹⁵⁾

Bottou, L.: Stochastic learning. In Bousquet, O., von Luxburg, U, eds.: Advanced Lectures on Machine Learning. Lecture Notes in Artiﬁcial Intelligence, LNAI 3176. Springer Verlag, Berlin (2004)

¹⁶⁾

Bottou, L., LeCun, Y.: Large scale online learning. In Thrun, S., Saul, L., Sch¨olkopf, B., eds.: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA (2004)

¹⁷⁾

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y., Saporta, G., eds.: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Paris, France, Springer (August 2010) 177–187

¹⁸⁾

Gasso, G., Pappaioannou, A., Spivak, M., Bottou, L.: Batch and online learning algorithms for nonconvex Neyman-Pearson classiﬁcation. ACM Transaction on Intelligent System and Technologies 2(3) (2011)

¹⁹⁾

Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transformations in perceptrons. Journal of Machine Learning Research - Proceedings Track 22 (2012) 924–932

²⁰⁾

LeCun, Y., Haﬀner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision. (1999)

²¹⁾

Velazco, M., Oliveira, A., Lyra, C.: Neural networks give a warm start to linear optimization problems. In: Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on. Volume 2. (2002) 1871 –1876

²²⁾

Sutton, R.S., Whitehead, S.D.: Online learning with random representations. In: In Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann (1993) 314–321

²³⁾

Saad, D., ed.: On-line learning in neural networks. Cambridge University Press, New York, NY, USA (1998)