public:books:eleftherios_soulas_2013_online_machine_learning_algorithms:intro

# Online machine learning algorithms for currency exchange prediction

Authors: Eleftherios Soulas and Dennis Shasha.
Date: April 2013

• Ongoing research on new type of Data Stream Management System (DBMS). Example of streaming database:
• Aurora, MAIDS, Niagara, Telegraph
• Zhu and Shasha 1) using the first few coeffs of DFT transform to compute the distance between time series (eg. find the correlation).
• Zhao and Shasha 2) 3) introduced a new sketch based data correlation strategy. Based on the Johnson-Lindenstrauss (JL) Lemma.
• Goal: find correlation over windows from the same or different streams on synchronous and asynchronous (a.k.a. lagged) variations.
• Synchronous correlation: Given $N_s$ streams, a start time $t_{start}$ and a window size w, find, for each time window W of size w, all pairs of streams S1 and S2 such that S1 during time window W is highly correlated (over 0.95) with S2 during the same time window.
• Asynchronous correlation: allow for difference windows (with size w) W1 and W2.
• Cooperative : time series that exhibit a substantial degree of regularity. Reduction techniques: Fourier transforms 4) 5) 6), Wavelet transform 7) 8)) , Singular Value Decomposition 9) , Piecewise Constant Approximations 10) .
• Uncooperative: Time series with no such periodical regularities.
• For uncooperative time series (for isntance returns of a stock), sketch-based approaches 11) can be used. We use random projection of the time series on random to exploit the bounds provided by the Johnson-Lindenstrauss lemma 12).
• Notions of time-point, basic window and sliding window in Statstream framework.
• Benefits of basic window (bw): if asking correlations for $[t_i, t_{i+sw}]$, we get results at $[t_{i+bw},t_{i+sw+bw}]$.
• Intuition of Sketch approach 13) 14) :
• Given a point $x \in \mathbb{R}^m$, we compute its dot product with d random vectors $r_i \in 1,-1^m$.
• The first random projection of x is given by $y_1 = (x \cdot r_1, x \cdot r_2,\dots, x \cdot r_d)$
• We compute 2b more such random projections, so that we get $y_1, y_2,\dots,y_{2b+1}$.
• If w is another point $w \in \mathbb{R}^m$, and $z_1, z_2,\dots,z_{2b+1}$ are its projection using the same random vectors. Then the median of $\|y_1 - z_1\|, \|y_2 - z_2\|, \dots, \|y_{2b+1} - z_{2b+1}\|$ is a good extimate of $\|x - w\|$.
• The estimate lies within a $\theta(\frac 1d)$ factor of $\|x - w\|$ with probability $1 - {\frac 12}^b$.
• Sketch implementation in statstream uses a structured random vector approach, providing a 30 to 40 factor improvement in runtime.
• Example:
• Build random vector $r_{bw} = (r_0,r_1,\dots,r_{bw})$ where is $r_i$ is 1 or -1 with probability 0.5.
• Build control vector $b = (b_0,b_1,\dots,b_{nb})$ where is $b_i$ is 1 or -1 with probability 0.5. $nb = \frac{sw}{bw}$.
• Build the random vector for sliding window: $r = (r_{bw}*b_0,r_{bw}*b_1,\dots,r_{bw}*b_{nb})$.
• Statstream implementation uses 60 random vectors.
• Pearson correlation is related to euclidian distance with: $D^2(\hat{x},\hat{y}) = 2(1-corr(x,y))$.
• Sketches work much better tha SVD method (see results in section 2.2.9).
• Standard error is related to standard deviation by: $SE_{\hat{x}} = \frac{std}{\sqrt{n}}$.
• Regressor trained on a sliding window $[t_i,t_{i+sw}]$ and predicting $t_{i+sw+1}$ with a linear model.
• We expect different outcomes with different window sizes: the window size that minimize the error is selected “off-line”.
• Here we randomly select which samples of the dataset would be candidate for training the model ⇒ using exponential random picking.
• Algorithm using warm start initialization 21)
• Problem of “catastrophic interference” 22)
• Idea of adaptative learning rate for variants of stochastic gradient descent 23)
• In the algorithm here a fixed learning rate is enough because of the sliding window usage.
• Absolute and relative error: $\text{relative error}= \frac{\text{absolute error}}{\text{value of thing measured}}$
• Mean squared error: $MSE(y,\hat{y}) = \frac 1{n_{samples}} \sum\limits_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2$
• Explained Variance score: $ExpVar(y,\hat{y}) = 1 - \frac{Var\{y - \hat{y}\}}{Var\{y\}}$
• $R^2$ Score (eg. coefficient of determination): $R^2(y,\hat{y}) = 1 - \frac {\sum_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2}{Var\{y\}}$ (Best possible score is 1.0, lower values are worse).
• Signals used in this algorithm for buy and sell:
• if $bid\_prediction_{t+1} > ask_t$, then buy at time t and sell later.
• if $ask\_prediction_{t+1} < bid_t$, then sell at time t and buy later.

1)
Shasha, D., Zhu, Y.: High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science). SpringerVerlag (2004)
2)
Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05, New York, NY, USA, ACM (2005) 743–749
3)
Zhao, X.: High performance algorithms for multiple streaming time series. PhD thesis, New York, NY, USA (2006) AAI3205697
4)
Agrawal, R., Faloutsos, C., Swami, A.N.: Eﬃcient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. FODO ’93, London, UK, UK, Springer-Verlag (1993)
5)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. Technical report, College Park, MD, USA (1993)
6)
Raﬁei, D., Mendelzon, A.: Similarity-based queries for time series data. SIGMOD Rec. 26(2) (June 1997) 13–25
7)
Eﬃcient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering. ICDE ’99, Washington, DC, USA, IEEE Computer Society (1999)
8)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surﬁng wavelets on streams: One-pass summaries for approximate aggregate queries. In: In VLDB. (2001
9)
Korn, F., Jagadish, H.V., Faloutsos, C.: Eﬃciently supporting ad hoc queries in large datasets of time sequences. In: In SIGMOD. (1997) 289–300
10)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. JOURNAL OF KNOWLEDGE AND INFORMATION SYSTEMS 3 (2000) 263–286
11)
Achlioptas, D.: Database-friendly random projections, ACM Press (2001) 274–281
12)
Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mappings into a hilbert space. (1982)
13)
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Eﬃcient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. STOC ’98, New York, NY, USA, ACM (1998) 614–623
14)
Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53(3) (May 2006) 307–323
15)
Bottou, L.: Stochastic learning. In Bousquet, O., von Luxburg, U, eds.: Advanced Lectures on Machine Learning. Lecture Notes in Artiﬁcial Intelligence, LNAI 3176. Springer Verlag, Berlin (2004)
16)
Bottou, L., LeCun, Y.: Large scale online learning. In Thrun, S., Saul, L., Sch¨olkopf, B., eds.: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA (2004)
17)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y., Saporta, G., eds.: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Paris, France, Springer (August 2010) 177–187
18)
Gasso, G., Pappaioannou, A., Spivak, M., Bottou, L.: Batch and online learning algorithms for nonconvex Neyman-Pearson classiﬁcation. ACM Transaction on Intelligent System and Technologies 2(3) (2011)
19)
Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transformations in perceptrons. Journal of Machine Learning Research - Proceedings Track 22 (2012) 924–932
20)
LeCun, Y., Haﬀner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision. (1999)
21)
Velazco, M., Oliveira, A., Lyra, C.: Neural networks give a warm start to linear optimization problems. In: Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on. Volume 2. (2002) 1871 –1876
22)
Sutton, R.S., Whitehead, S.D.: Online learning with random representations. In: In Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann (1993) 314–321
23)
Saad, D., ed.: On-line learning in neural networks. Cambridge University Press, New York, NY, USA (1998)
• public/books/eleftherios_soulas_2013_online_machine_learning_algorithms/intro.txt