====== Online machine learning algorithms for currency exchange prediction ====== Authors: Eleftherios Soulas and Dennis Shasha.\\ Date: April 2013 /* This is a comment */ ===== 2. Statstream renewed ===== ==== 2.1 - Background Work ==== * Ongoing research on new type of Data Stream Management System (DBMS). Example of streaming database: * Aurora, MAIDS, Niagara, Telegraph * Zhu and Shasha ((Shasha, D., Zhu, Y.: High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science). SpringerVerlag (2004) )) using the first few coeffs of DFT transform to compute the distance between time series (eg. find the correlation). * Zhao and Shasha ((Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05, New York, NY, USA, ACM (2005) 743–749)) ((Zhao, X.: High performance algorithms for multiple streaming time series. PhD thesis, New York, NY, USA (2006) AAI3205697)) introduced a new sketch based data correlation strategy. Based on the **Johnson-Lindenstrauss (JL) Lemma**. * Infos on statstream at: http://cs.nyu.edu/shasha/papers/statstream.html ==== 2.2 - Statstream ==== * Goal: find correlation over windows from the same or different streams on synchronous and asynchronous (a.k.a. lagged) variations. * **Synchronous correlation**: Given \(N_s\) streams, a start time \(t_{start}\) and a window size w, find, for each time window W of size w, all pairs of streams S1 and S2 such that S1 during time window W is highly correlated (over 0.95) with S2 during the same time window. * **Asynchronous correlation**: allow for difference windows (with size w) W1 and W2. * **Cooperative** : time series that exhibit a substantial degree of regularity. Reduction techniques: Fourier transforms ((Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. FODO ’93, London, UK, UK, Springer-Verlag (1993) )) ((Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. Technical report, College Park, MD, USA (1993) )) ((Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. SIGMOD Rec. 26(2) (June 1997) 13–25 )), Wavelet transform ((Efficient time series matching by wavelets. In: Proceedings of the 15th International Conference on Data Engineering. ICDE ’99, Washington, DC, USA, IEEE Computer Society (1999) )) (( Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In: In VLDB. (2001))) , Singular Value Decomposition ((Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: In SIGMOD. (1997) 289–300)) , Piecewise Constant Approximations ((Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. JOURNAL OF KNOWLEDGE AND INFORMATION SYSTEMS 3 (2000) 263–286)) . * **Uncooperative**: Time series with no such periodical regularities. * For uncooperative time series (for isntance returns of a stock), sketch-based approaches ((Achlioptas, D.: Database-friendly random projections, ACM Press (2001) 274–281)) can be used. We use random projection of the time series on random to exploit the bounds provided by the Johnson-Lindenstrauss lemma ((Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mappings into a hilbert space. (1982) )). * Notions of **time-point**, **basic window** and **sliding window** in Statstream framework. * Benefits of basic window (bw): if asking correlations for \([t_i, t_{i+sw}]\), we get results at \([t_{i+bw},t_{i+sw+bw}]\). * Intuition of Sketch approach (( Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing. STOC ’98, New York, NY, USA, ACM (1998) 614–623)) ((Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53(3) (May 2006) 307–323)) : * Given a point \(x \in \mathbb{R}^m\), we compute its dot product with d random vectors \(r_i \in 1,-1^m\). * The first random projection of x is given by \(y_1 = (x \cdot r_1, x \cdot r_2,\dots, x \cdot r_d)\) * We compute 2b more such random projections, so that we get \(y_1, y_2,\dots,y_{2b+1}\). * If w is another point \(w \in \mathbb{R}^m\), and \(z_1, z_2,\dots,z_{2b+1}\) are its projection using the same random vectors. Then the median of \(\|y_1 - z_1\|, \|y_2 - z_2\|, \dots, \|y_{2b+1} - z_{2b+1}\| \) is a good extimate of \(\|x - w\|\). * The estimate lies within a \(\theta(\frac 1d)\) factor of \(\|x - w\|\) with probability \(1 - {\frac 12}^b\). * Sketch implementation in statstream uses a structured random vector approach, providing a 30 to 40 factor improvement in runtime. * Example: * Build random vector \(r_{bw} = (r_0,r_1,\dots,r_{bw})\) where is \(r_i\) is 1 or -1 with probability 0.5. * Build control vector \(b = (b_0,b_1,\dots,b_{nb})\) where is \(b_i\) is 1 or -1 with probability 0.5. \(nb = \frac{sw}{bw}\). * Build the random vector for sliding window: \(r = (r_{bw}*b_0,r_{bw}*b_1,\dots,r_{bw}*b_{nb})\). * Statstream implementation uses 60 random vectors. * Pearson correlation is related to euclidian distance with: \(D^2(\hat{x},\hat{y}) = 2(1-corr(x,y))\). * Sketches work much better tha SVD method (see results in section 2.2.9). ==== 2.3 - Statstream renewed ==== * Standard error is related to standard deviation by: \(SE_{\hat{x}} = \frac{std}{\sqrt{n}}\). ===== 3. Online Machine Learning ===== ===== 4. Background ===== ==== 4.1 - Investment options ==== ==== 4.2 - Machine learning ==== * Theoretical background for this paper given in ((Bottou, L.: Stochastic learning. In Bousquet, O., von Luxburg, U, eds.: Advanced Lectures on Machine Learning. Lecture Notes in Artificial Intelligence, LNAI 3176. Springer Verlag, Berlin (2004) )) ((Bottou, L., LeCun, Y.: Large scale online learning. In Thrun, S., Saul, L., Sch¨olkopf, B., eds.: Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA (2004) )) ((Bottou, L.: Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y., Saporta, G., eds.: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Paris, France, Springer (August 2010) 177–187)) ((Gasso, G., Pappaioannou, A., Spivak, M., Bottou, L.: Batch and online learning algorithms for nonconvex Neyman-Pearson classification. ACM Transaction on Intelligent System and Technologies 2(3) (2011) )) ((Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transformations in perceptrons. Journal of Machine Learning Research - Proceedings Track 22 (2012) 924–932)) (( LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision. (1999) )) ===== 5. Predicting the future of financial data in real time ===== ==== 5.3 - LearnStream ==== * Regressor trained on a sliding window \([t_i,t_{i+sw}]\) and predicting \(t_{i+sw+1}\) with a linear model. * We expect different outcomes with different window sizes: the window size that minimize the error is selected "off-line". * Here we randomly select which samples of the dataset would be candidate for training the model => using **exponential random picking**. * Algorithm using **warm start** initialization ((Velazco, M., Oliveira, A., Lyra, C.: Neural networks give a warm start to linear optimization problems. In: Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on. Volume 2. (2002) 1871 –1876)) * Problem of "catastrophic interference" ((Sutton, R.S., Whitehead, S.D.: Online learning with random representations. In: In Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann (1993) 314–321)) * Idea of adaptative learning rate for variants of stochastic gradient descent (( Saad, D., ed.: On-line learning in neural networks. Cambridge University Press, New York, NY, USA (1998) )) * In the algorithm here a fixed learning rate is enough because of the sliding window usage. ===== 6. Experimental results ===== * Absolute and relative error: \(\text{relative error}= \frac{\text{absolute error}}{\text{value of thing measured}}\) * Mean squared error: \(MSE(y,\hat{y}) = \frac 1{n_{samples}} \sum\limits_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2\) * Explained Variance score: \(ExpVar(y,\hat{y}) = 1 - \frac{Var\{y - \hat{y}\}}{Var\{y\}} \) * \(R^2\) Score (eg. coefficient of determination): \(R^2(y,\hat{y}) = 1 - \frac {\sum_{i=0}^{n_{samples}-1} (y_i - \hat{y}_i)^2}{Var\{y\}}\) (Best possible score is 1.0, lower values are worse). * Signals used in this algorithm for buy and sell: * if \(bid\_prediction_{t+1} > ask_t\), then buy at time t and sell later. * if \(ask\_prediction_{t+1} < bid_t\), then sell at time t and buy later. /* Done on 13/01/2015.*/