Restricted Boltzmann Machine
Hinton - A practical guide to Training RBMs
Conditional form of RBMs can be used to model high-dimensional temporal sequences (eg. videos or motion captures) \(\Rightarrow\) see Taylor et al., 2006
Components for deep belief nets (Hinton et al., 2006a)
Overview of Contrastive Divergence
Energy for config \((\mathbf{v},\mathbf{h})\): \(E(\mathbf{v},\mathbf{h}) = - \sum\limits_{i \in visible} a_iv_i - \sum\limits_{j \in hidden} b_jh_j - \sum\limits_{i,j} v_ih_jw_{ij}\)
Network assigned probability: \(p(\mathbf{v},\mathbf{h}) = \frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h})}\)
With \(Z = \sum\limits_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h})}\) : the partition function
Derivative of log likelyhood: \(\frac{\partial log(p(\mathbf{v}))}{\partial w_{ij}} = \langle v_ih_j \rangle_{data} - \langle v_ih_j \rangle_{model}\)
With \(\langle . \rangle\) : expectation uner the distribution specified by the subscript.
Unbiased sample of \(\langle v_ih_j \rangle_{data}\)
\[\begin{equation}p(h_j=1 | \mathbf{v}) = \sigma(b_j + \sum\limits_i v_iw_{ij})\end{equation}\label{hid_prob}\]
\[\begin{equation}p(v_i=1 | \mathbf{h}) = \sigma(a_i + \sum\limits_j h_jw_{ij})\end{equation}\label{vis_prob}\]
Unbiased sample of \(\langle v_ih_j \rangle_{model}\)
Or better option:
Set v to training vector
Compute h units once
Build reconstruction by setting \(v_i\) to 1 with prob as computed from \(\eqref{vis_prob}\).
Change wieght with: \(\Delta w_{ij} = \epsilon (\langle v_ih_j \rangle_{data} - \langle v_ih_j \rangle_{recon}) \)
This learning rule closely approximating objective function called “Contrastive Divergence” : eg. difference between 2 Kullback-Liebler divergences.
⇒ In fact, not following any function gradient (see Sutskever and Tieleman, 2010), but it works well.
Collect stats with Contrastive Divergence
Updating the hidden states
Given data vector v, we use equation \(\eqref{hid_prob}\): hidden unit turns on if this probability is greater than a random number uniformly distributed between 0 and 1.
Total input for hidden unit j is: \(b_j + \sum\limits_i v_iw_{ij}\)
Updating the visible states
Given hidden vector h, we use equation \(\eqref{vis_prob}\). Should assign binary value to visible units, but we assign the probability directly ⇒ will reduce sampling noise and allow faster learning.
Collecting the statistics
For the positive statistics, can use either \(\langle p_ih_j \rangle_{data}\) or \(\langle p_ip_j \rangle_{data}\)
using \(p_j\) usually has less sampling noise ⇒ slightly faster learning.
Recipe for getting learning signal for CD1
When the hidden units are being driven by data, always use stochastic binary states. When they are being driven by reconstructions, always use probabilities without sampling.
Assuming the visible units use the logistic function, use real-valued probabilities for both the data and the reconstructions.
When collecting the pairwise statistics for learning weights or the individual statistics for learning biases, use the probabilities, not the binary states, and make sure the weights have random initial values to break symmetry.
Size of a mini batch
More efficient to use mini-batches of 10 to 100 cases.
To avoid having to change the learning rate when the size of a mini-batch is changed, it is helpful to divide the total gradient computed on a mini-batch by the size of the mini-batch.
mini-batch size about 10 to start is good.
Monitor learning progress
Should compute the squared error between data and reconstruction.
Reconstruction error on entire training set should fall rapidly and consistently at the start of learning and then more slowly.
The reconstruction error is actually a very poor measure of the progress of learning.
consider using Annealed Importance Sampling (Salakhutdinov and Murray, 2008) to estimate the density on held out data.
Monitoring overfitting
After every few epochs, compute the average free energy of a representative subset of the training data and compare it with the average free energy of a validation set.
Always use the same subset of the training data.
If the gap starts growing, the model is overfitting (though the probability of the
training data may be growing even faster than the gap, so the probability of the validation data may still be improving).
The learning rate
Look at a histogram of the weight updates and a histogram of the weights. The updates should be about \(10^{-3}\) times the weights (to within about an order of magnitude)
When a unit has a very large fan-in, the updates should be smaller since many small changes in the same direction can easily reverse the sign of the gradient.
Conversely, for biases, the updates can be bigger.
The initial values of the weights and biases
Use small random values for the weights chosen from a zero-mean Gaussian with a standard deviation of 0.01
Set the hidden biases to 0. Set the visible biases to log[pi/(1 − pi)] where pi is the proportion of training vectors in which unit i is on.
Look at the activities of the hidden units occasionally to check that they are not always on or off.
When using a sparsity target probability of t, it makes sense to initialize the hidden biases to be log[t/(1 − t)].
Momentum
\[\Delta \theta_i(t) = \nu_i(t) = \alpha \nu_i(t-1) - \epsilon \frac{dE}{d\theta_i}(t)\]
error has settled down to gentle progress, increase the momentum to 0.9.
Weight-decay
For an RBM, sensible values for the weight-cost coefficient for L2 weight-decay typically range from 0.01 to 0.00001.
Weight-cost is typically not applied to the hidden and visible biases because there are far fewer of these so they are less likely to cause overfitting. (and the biases sometimes need to be quite large).
Try an initial weight-cost of 0.0001. If you are using Annealed Importance Sampling (Salakhut-dinov and Murray, 2008) to estimate the density on a held-out validation set, try adjusting the weight-cost by factors of 2 to optimize density.
Encouraging sparse hidden activities
Sparse activities of the binary hidden units can be achieved by specifying a sparsity target which is the desired probability of being active, \(p << 1\).
An additional penalty term is then used to encourage the actual probability of being active, q, to be close to p.
q is estimated by using an exponentially decaying average of the mean probability that a unit is active in each mini-batch: \[q_{new} = \lambda q_{old} + (1-\lambda)q_{current}\]
where \(q_{current}\) is the mean activation probability of the hidden unit on the current mini-batch.
\[penalty \propto -p log(q) - (1-p) log(1-q)\]
For logistic units this has a simple derivative of q − p with respect to the total input to a unit. This derivative, scaled by a meta-parameter called sparsity-cost, is used to adjust both the bias and the incoming weights of each hidden unit.
Set the sparsity target to between 0.01 and 0.1.
Set the decay-rate, \(\lambda\) of the estimated value of q to be between 0.9 and 0.99.
Histogram the mean activities of the hidden units and set the sparsity-cost so that the hidden units have mean probabilities in the vicinity of the target.
If the probabilities are tightly clustered around the target value, reduce the sparsity-cost so that it interferes less with the main objective of the learning.
Number of hidden units
Estimate how many bits it would take to describe each data-vector if you were using a good model (i.e. estimate the typical negative \(log_2\) probability of a datavector under a good model)
Multiply this estimate by the number of training cases and use a number of parameters that is about an order of magnitude smaller.
If you are using a sparsity target that is very small, you may be able to use more hidden units.
If the training cases are highly redundant, as they typically will be for very big training sets, you need to use fewer parameters.
Different types of unit
A general treatment for units in the exponential family is given in (Welling et al., 2005)
The main use of other types of unit is for dealing with data that is not well-modeled by binary (or logistic) visible units.
Softmax units
Gaussian visible units
\[E(\mathbf{v},\mathbf{h}) = \sum\limits_{i \in Vis} \frac{(v_i - a_i)^2}{2\sigma_i^2} - \sum\limits_{j \in hid} b_jh_j - \sum\limits_{i,j} \frac{v_i}{\sigma_i} h_jw_{ij}\]
Gaussian visible and hidden units
Binomial units
Rectified linear units
Varieties of contrastive divergence
Displaying what is happening during learning
Histograms of the weights, the visible biases and the hidden biases are very useful.
It is useful to examine histograms of the increments to these parameters when they are updated (but not after each update).
When there is (spatial or temporal) structure in the visible units it is very helpful to display, for each hidden unit, the weights connecting that hidden unit to the visible units. Gray-scale displays of receptive fields are usually better.
For a single minibatch, it is very useful to see a two-dimensional, gray-scale display with a range of [0,1] that shows the probability of each binary hidden unit on each training case in a mini-batch.
Using RBM’s for discrimination
Dealing with missing values
Taylor and Hinton - 2006 - Modeling Human Motion using Binary Latent Variables
An energy-based model for vectors of real-values
\[- log~p(\mathbf{v},\mathbf{h}) = \sum\limits_i \frac{(v_i - c_i)^2}{2\sigma_i^2} - \sum\limits_j b_jh_j - \sum\limits_{i,j} \frac{v_i}{\sigma_i} h_jw_{ij} + const\]
In practice, we renormalize the data to have zero mean and unit variance.
The conditional distributions (assuming \(\sigma_i = 1\)) are:
\[p(h_j = 1 |\mathbf{v}) = f(b_j + \sum\limits_i v_iw_{ij})\]
\[p(v_i|\mathbf{h}) = \mathscr{N}(c_i + \sum\limits_j h_jw_{ij},1)\]
Conditional RBM model
We can model temporal dependencies by treating the visible variables in the previous time slice as additional fixed inputs.
Adding 2 types of directed connections:
autoregressive connections from the past n configurations (time steps) of the visible units to the current visible configuration.
connections from the past m visibles to the current hidden configuration
⇒ These connections turn the RBM into a conditional RBM (CRBM)
In experiments: n = m = 3
\[\begin{equation}\Delta d_{ij}^{(t-q)} \propto v_i^{t-q} (\langle h_j^t \rangle_{data} - \langle h_j^t \rangle_{recon})\end{equation}\label{dir_v_h}\]
\[\begin{equation}\Delta a_{ki}^{(t-q)} \propto v_k^{t-q} (v_i^t - \langle v_i^t \rangle_{recon})\end{equation}\label{dir_v_v}\]
The autoregressive weights can model short-term temporal structure very well, leaving the hidden units to model longer-term, higher level structure.
We do not attempt to model the first n frames of each sequence
The updates are only conditional on the past n time steps, not the entire sequence.
To save computer memory, time frames are not actually replicated in mini-batches; we simply use indexing to simulate the “chunking” of frames.
Approximations
Data gathering and preprocessing
Experiments
Higher level model
Once we have trained the model, we can add layers like in a Deep Belief Network [14]. The previous layer CRBM is kept, and the sequence of hidden state vectors, while driven by the data, is treated as a new kind of “fully observed” data.
Discussion
If we augment the data with some static skeletal and identity parameters (in essence mapping a person’s unique identity to a set of features), we should be able to use the same generative model for many different people, and generalize individual characteristics from one type of motion to another.
Recurrent Neural Networks
The Unreasonable Effectiveness of Recurrent Neural Networks
What are Recurrent Neural Networks
* RNN allow us to operate over **sequences** of vectors.
* Check http://deepmind.com/
Written as a class, the RNN's
API consists of a single step function:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
Example implementation with a single hidden vector h with vanilla RNN:
class RNN:
# ...
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
We can stack multiple RNNs with:
y1 = rnn1.step(x)
y = rnn2.step(y1)
Tutorial on LSTM Recurrent Networks
-
From Jurgen Schmidhuber
Sequential problems:
Control of attention
Sequence recognition
Motor control
Almost every real world task
Other sequence learners:
Hidden Markov Models: discrete, cannot store real values.
Symbolic approaches
Heuristic program search: no direction for search in algorithm space
Universal search
Fastest algorithm
Optimal ordered problem solver
LSTM is about “error carrousels”, “cells” and “blocks” ?
-
Hochreiter used LSTM for metalearning (2001) and observed astonishing results ⇒ “LSTM metalearner” ?
LSTM implementation explained
-
Because of the gating mechanism the cell can keep a piece of information for long periods of time during work and protect the gradient inside the cell from harmful changes during the training.
Vanilla LSTMs don’t have a forget gate and add unchanged cell state during the update (it can be seen as a recurrent connection with a constant weight of 1), what is often referred to as a Constant Error Carousel (CEC). It’s called like that, because it solves a serious RNN training problem of vanishing and exploding gradients, which in turn makes it possible to learn long-term relationships.
Understanding LSTM Networks