Online machine learning algorithms for currency exchange prediction

Authors: Eleftherios Soulas and Dennis Shasha.
Date: April 2013

2. Statstream renewed

2.1 - Background Work

  • Ongoing research on new type of Data Stream Management System (DBMS). Example of streaming database:
    • Aurora, MAIDS, Niagara, Telegraph
  • Zhu and Shasha 1) using the first few coeffs of DFT transform to compute the distance between time series (eg. find the correlation).
  • Zhao and Shasha 2) 3) introduced a new sketch based data correlation strategy. Based on the Johnson-Lindenstrauss (JL) Lemma.

2.2 - Statstream

  • Goal: find correlation over windows from the same or different streams on synchronous and asynchronous (a.k.a. lagged) variations.
  • Synchronous correlation: Given \(N_s\) streams, a start time \(t_{start}\) and a window size w, find, for each time window W of size w, all pairs of streams S1 and S2 such that S1 during time window W is highly correlated (over 0.95) with S2 during the same time window.
  • Asynchronous correlation: allow for difference windows (with size w) W1 and W2.
  • Cooperative : time series that exhibit a substantial degree of regularity. Reduction techniques: Fourier transforms 4) 5) 6), Wavelet transform 7) 8)) , Singular Value Decomposition 9) , Piecewise Constant Approximations 10) .
  • Uncooperative: Time series with no such periodical regularities.
  • For uncooperative time series (for isntance returns of a stock), sketch-based approaches 11) can be used. We use random projection of the time series on random to exploit the bounds provided by the Johnson-Lindenstrauss lemma 12).
  • Notions of time-point, basic window and sliding window in Statstream framework.
  • Benefits of basic window (bw): if asking correlations for \([t_i, t_{i+sw}]\), we get results at \([t_{i+bw},t_{i+sw+bw}]\).
  • Intuition of Sketch approach 13) 14) :
    • Given a point \(x \in \mathbb{R}^m\), we compute its dot product with d random vectors \(r_i \in 1,-1^m\).
    • The first random projection of x is given by \(y_1 = (x \cdot r_1, x \cdot r_2,\dots, x \cdot r_d)\)
    • We compute 2b more such random projections, so that we get \(y_1, y_2,\dots,y_{2b+1}\).
    • If w is another point \(w \in \mathbb{R}^m\), and \(z_1, z_2,\dots,z_{2b+1}\) are its projection using the same random vectors. Then the median of \(\|y_1 - z_1\|, \|y_2 - z_2\|, \dots, \|y_{2b+1} - z_{2b+1}\| \) is a good extimate of \(\|x - w\|\).
    • The estimate lies within a \(\theta(\frac 1d)\) factor of \(\|x - w\|\) with probability \(1 - {\frac 12}^b\).
  • Sketch implementation in statstream uses a structured random vector approach, providing a 30 to 40 factor improvement in runtime.
  • Example:
    • Build random vector \(r_{bw} = (r_0,r_1,\dots,r_{bw})\) where is \(r_i\) is 1 or -1 with probability 0.5.
    • Build control vector \(b = (b_0,b_1,\dots,b_{nb})\) where is \(b_i\) is 1 or -1 with probability 0.5. \(nb = \frac{sw}{bw}\).
    • Build the random vector for sliding window: \(r = (r_{bw}*b_0,r_{bw}*b_1,\dots,r_{bw}*b_{nb})\).
  • Statstream implementation uses 60 random vectors.
  • Pearson correlation is related to euclidian distance with: \(D^2(\hat{x},\hat{y}) = 2(1-corr(x,y))\).
  • Sketches work much better tha SVD method (see results in section 2.2.9).

2.3 - Statstream renewed

  • Standard error is related to standard deviation by: \(SE_{\hat{x}} = \frac{std}{\sqrt{n}}\).

3. Online Machine Learning

4. Background

4.1 - Investment options

4.2 - Machine learning

5. Predicting the future of financial data in real time

5.3 - LearnStream

  • Regressor trained on a sliding window \([t_i,t_{i+sw}]\) and predicting \(t_{i+sw+1}\) with a linear model.
  • We expect different outcomes with different window sizes: the window size that minimize the error is selected “off-line”.
  • Here we randomly select which samples of the dataset would be candidate for training the model ⇒ using exponential random picking.
  • Algorithm using warm start initialization 21)
  • Problem of “catastrophic interference” 22)
  • Idea of adaptative learning rate for variants of stochastic gradient descent 23)
  • In the algorithm here a fixed learning rate is enough because of the sliding window usage.

6. Experimental results

  • Absolute and relative error: \(\text{relative error}= \frac{\text{absolute error}}{\text{value of thing measured}}\)
  • Mean squared error: \(MSE(y,\hat{y}) = \frac 1{n_{samples}} \sum\limits_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2\)
  • Explained Variance score: \(ExpVar(y,\hat{y}) = 1 - \frac{Var\{y - \hat{y}\}}{Var\{y\}} \)
  • \(R^2\) Score (eg. coefficient of determination): \(R^2(y,\hat{y}) = 1 - \frac {\sum_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2}{Var\{y\}}\) (Best possible score is 1.0, lower values are worse).
  • Signals used in this algorithm for buy and sell:
    • if \(bid\_prediction_{t+1} > ask_t\), then buy at time t and sell later.
    • if \(ask\_prediction_{t+1} < bid_t\), then sell at time t and buy later.
1)
Shasha, D., Zhu, Y.: High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science). SpringerVerlag (2004)
2)
Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05, New York, NY, USA, ACM (2005) 743–749
3)
Zhao, X.: High performance algorithms for multiple streaming time series. PhD thesis, New York, NY, USA (2006) AAI3205697
4)
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. FODO ’93, London, UK, UK, Springer-Verlag (1993)
5)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. Technical report, College Park, MD, USA (1993)
6)
Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. SIGMOD Rec. 26(2) (June 1997) 13–25
7)
Efficient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering. ICDE ’99, Washington, DC, USA, IEEE Computer Society (1999)
8)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In: In VLDB. (2001
9)
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: In SIGMOD. (1997) 289–300
10)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. JOURNAL OF KNOWLEDGE AND INFORMATION SYSTEMS 3 (2000) 263–286
11)
Achlioptas, D.: Database-friendly random projections, ACM Press (2001) 274–281
12)
Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mappings into a hilbert space. (1982)
13)
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. STOC ’98, New York, NY, USA, ACM (1998) 614–623
14)
Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53(3) (May 2006) 307–323
15)
Bottou, L.: Stochastic learning. In Bousquet, O., von Luxburg, U, eds.: Advanced Lectures on Machine Learning. Lecture Notes in Artificial Intelligence, LNAI 3176. Springer Verlag, Berlin (2004)
16)
Bottou, L., LeCun, Y.: Large scale online learning. In Thrun, S., Saul, L., Sch¨olkopf, B., eds.: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA (2004)
17)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y., Saporta, G., eds.: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Paris, France, Springer (August 2010) 177–187
18)
Gasso, G., Pappaioannou, A., Spivak, M., Bottou, L.: Batch and online learning algorithms for nonconvex Neyman-Pearson classification. ACM Transaction on Intelligent System and Technologies 2(3) (2011)
19)
Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transformations in perceptrons. Journal of Machine Learning Research - Proceedings Track 22 (2012) 924–932
20)
LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision. (1999)
21)
Velazco, M., Oliveira, A., Lyra, C.: Neural networks give a warm start to linear optimization problems. In: Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on. Volume 2. (2002) 1871 –1876
22)
Sutton, R.S., Whitehead, S.D.: Online learning with random representations. In: In Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann (1993) 314–321
23)
Saad, D., ed.: On-line learning in neural networks. Cambridge University Press, New York, NY, USA (1998)