public:courses:machine_learning:rbms:rbms_intro

# Restricted Boltzmann Machine

• Conditional form of RBMs can be used to model high-dimensional temporal sequences (eg. videos or motion captures) $\Rightarrow$ see Taylor et al., 2006
• Components for deep belief nets (Hinton et al., 2006a)
• Contrastive divergence learning procedure $\Rightarrow$ see Hinton, 2002
• Meta-parameters:
1. learning rate
2. Momentum
3. Weight-cost
4. Sparsity target
5. Initial weights
6. Number of hidden units
7. Mini-batch size
• What type of units ?
• Stochastic vs. deterministic state update ?
• How many hidden state update for each training case ?
• Start sequence at data vector ? (or hidden vector) ?
• Energy for config $(\mathbf{v},\mathbf{h})$: $E(\mathbf{v},\mathbf{h}) = - \sum\limits_{i \in visible} a_iv_i - \sum\limits_{j \in hidden} b_jh_j - \sum\limits_{i,j} v_ih_jw_{ij}$
• $v_i, h_j$: binary states of visible and hidden units
• $a_i, b_j$: biases
• $w_{ij}$: weights
• Network assigned probability: $p(\mathbf{v},\mathbf{h}) = \frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h})}$
• With $Z = \sum\limits_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h})}$ : the partition function
• Probability of visible state $\mathbf{v}$ : $p(\mathbf{v}) = \frac{1}{Z} \sum\limits_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h})}$
• Derivative of log likelyhood: $\frac{\partial log(p(\mathbf{v}))}{\partial w_{ij}} = \langle v_ih_j \rangle_{data} - \langle v_ih_j \rangle_{model}$
• With $\langle . \rangle$ : expectation uner the distribution specified by the subscript.
• Learning rule for stochastic steepest ascent of log probability: $\Delta w_{ij} = \epsilon (\langle v_ih_j \rangle_{data} - \langle v_ih_j \rangle_{model})$
• With $\epsilon$ : learning rate

#### Unbiased sample of $\langle v_ih_j \rangle_{data}$

• Given training image v, $h_j$ is set to 1 with probability:

$$$p(h_j=1 | \mathbf{v}) = \sigma(b_j + \sum\limits_i v_iw_{ij})$$\label{hid_prob}$

• Where $\sigma(x) = \frac{1}{1+e^{-x}}$ : the sigmoid function
• $v_ih_j$ is an unbiased sample.
• Similarly, given hidden units value h:

$$$p(v_i=1 | \mathbf{h}) = \sigma(a_i + \sum\limits_j h_jw_{ij})$$\label{vis_prob}$

#### Unbiased sample of $\langle v_ih_j \rangle_{model}$

• More difficult to compute
• One option is:
1. set visible units to random
2. perform alternating Gibbs sampling (from visible to hidden, then from hidden to visible)
• Or better option:
1. Set v to training vector
2. Compute h units once
3. Build reconstruction by setting $v_i$ to 1 with prob as computed from $\eqref{vis_prob}$.
4. Change wieght with: $\Delta w_{ij} = \epsilon (\langle v_ih_j \rangle_{data} - \langle v_ih_j \rangle_{recon})$
• This learning rule closely approximating objective function called “Contrastive Divergence” : eg. difference between 2 Kullback-Liebler divergences.
• ⇒ In fact, not following any function gradient (see Sutskever and Tieleman, 2010), but it works well.
• RBMs learn better if more steps of Gibbs sampling are taking, we note $CD_n$ the learning with n full steps of alternating Gibbs sampling.

#### Updating the hidden states

• Given data vector v, we use equation $\eqref{hid_prob}$: hidden unit turns on if this probability is greater than a random number uniformly distributed between 0 and 1.
• Total input for hidden unit j is: $b_j + \sum\limits_i v_iw_{ij}$
• When using $CD_n$ only the ﬁnal update of the hidden units should use the probability, other updates should use a binary value for the hidden units.

#### Updating the visible states

• Given hidden vector h, we use equation $\eqref{vis_prob}$. Should assign binary value to visible units, but we assign the probability directly ⇒ will reduce sampling noise and allow faster learning.

#### Collecting the statistics

• For the positive statistics, can use either $\langle p_ih_j \rangle_{data}$ or $\langle p_ip_j \rangle_{data}$
• using $p_j$ usually has less sampling noise ⇒ slightly faster learning.

#### Recipe for getting learning signal for CD1

• When the hidden units are being driven by data, always use stochastic binary states. When they are being driven by reconstructions, always use probabilities without sampling.
• Assuming the visible units use the logistic function, use real-valued probabilities for both the data and the reconstructions.
• When collecting the pairwise statistics for learning weights or the individual statistics for learning biases, use the probabilities, not the binary states, and make sure the weights have random initial values to break symmetry.
• More efficient to use mini-batches of 10 to 100 cases.
• To avoid having to change the learning rate when the size of a mini-batch is changed, it is helpful to divide the total gradient computed on a mini-batch by the size of the mini-batch.
• mini-batch size about 10 to start is good.
• Should compute the squared error between data and reconstruction.
• Reconstruction error on entire training set should fall rapidly and consistently at the start of learning and then more slowly.
• The reconstruction error is actually a very poor measure of the progress of learning.
• consider using Annealed Importance Sampling (Salakhutdinov and Murray, 2008) to estimate the density on held out data.
• After every few epochs, compute the average free energy of a representative subset of the training data and compare it with the average free energy of a validation set.
• Always use the same subset of the training data.
• If the gap starts growing, the model is overfitting (though the probability of the

training data may be growing even faster than the gap, so the probability of the validation data may still be improving).

• Make sure that the same weights are used when computing the two averages that you wish to compare.
• Look at a histogram of the weight updates and a histogram of the weights. The updates should be about $10^{-3}$ times the weights (to within about an order of magnitude)
• When a unit has a very large fan-in, the updates should be smaller since many small changes in the same direction can easily reverse the sign of the gradient.
• Conversely, for biases, the updates can be bigger.
• Use small random values for the weights chosen from a zero-mean Gaussian with a standard deviation of 0.01
• Set the hidden biases to 0. Set the visible biases to log[pi/(1 − pi)] where pi is the proportion of training vectors in which unit i is on.
• Look at the activities of the hidden units occasionally to check that they are not always on or off.
• When using a sparsity target probability of t, it makes sense to initialize the hidden biases to be log[t/(1 − t)].
• Simulation of velocity for a ball following the optimization process. Velocity decay other time and momentum meta-parameter $\alpha$ is the fraction of the previous velocity that remains after computing the gradient on a new mini-batch:

$\Delta \theta_i(t) = \nu_i(t) = \alpha \nu_i(t-1) - \epsilon \frac{dE}{d\theta_i}(t)$

• Once the large initial progress in the reduction of the reconstruction

error has settled down to gentle progress, increase the momentum to 0.9.

• This shock may cause a transient increase in the reconstruction error. If this causes a more lasting instability, keep reducing the learning rate by factors of 2 until the instability disappears.
• For an RBM, sensible values for the weight-cost coeﬃcient for L2 weight-decay typically range from 0.01 to 0.00001.
• Weight-cost is typically not applied to the hidden and visible biases because there are far fewer of these so they are less likely to cause overfitting. (and the biases sometimes need to be quite large).
• Try an initial weight-cost of 0.0001. If you are using Annealed Importance Sampling (Salakhut-dinov and Murray, 2008) to estimate the density on a held-out validation set, try adjusting the weight-cost by factors of 2 to optimize density.
• Sparse activities of the binary hidden units can be achieved by specifying a sparsity target which is the desired probability of being active, $p << 1$.
• An additional penalty term is then used to encourage the actual probability of being active, q, to be close to p.
• q is estimated by using an exponentially decaying average of the mean probability that a unit is active in each mini-batch: $q_{new} = \lambda q_{old} + (1-\lambda)q_{current}$
• where $q_{current}$ is the mean activation probability of the hidden unit on the current mini-batch.
• The natural penalty measure to use is the cross entropy between the desired and actual distributions:

$penalty \propto -p log(q) - (1-p) log(1-q)$

• For logistic units this has a simple derivative of q − p with respect to the total input to a unit. This derivative, scaled by a meta-parameter called sparsity-cost, is used to adjust both the bias and the incoming weights of each hidden unit.
• Set the sparsity target to between 0.01 and 0.1.
• Set the decay-rate, $\lambda$ of the estimated value of q to be between 0.9 and 0.99.
• Histogram the mean activities of the hidden units and set the sparsity-cost so that the hidden units have mean probabilities in the vicinity of the target.
• If the probabilities are tightly clustered around the target value, reduce the sparsity-cost so that it interferes less with the main objective of the learning.
• Estimate how many bits it would take to describe each data-vector if you were using a good model (i.e. estimate the typical negative $log_2$ probability of a datavector under a good model)
• Multiply this estimate by the number of training cases and use a number of parameters that is about an order of magnitude smaller.
• If you are using a sparsity target that is very small, you may be able to use more hidden units.
• If the training cases are highly redundant, as they typically will be for very big training sets, you need to use fewer parameters.
• A general treatment for units in the exponential family is given in (Welling et al., 2005)
• The main use of other types of unit is for dealing with data that is not well-modeled by binary (or logistic) visible units.

#### Softmax units

• softmax units has the probability of turning on given by: $p_j = \frac{e^{x_j}}{\sum_{i=1}^Ke^{x_i}}$
• softmax is the appropriate way to deal with a quantity that has K alternative values which are not ordered in any way.

#### Gaussian visible units

• Replace the binary visible units by linear units with independent Gaussian noise. The energy function then becomes:

$E(\mathbf{v},\mathbf{h}) = \sum\limits_{i \in Vis} \frac{(v_i - a_i)^2}{2\sigma_i^2} - \sum\limits_{j \in hid} b_jh_j - \sum\limits_{i,j} \frac{v_i}{\sigma_i} h_jw_{ij}$

• The learning rate needs to be about one or two orders of magnitude smaller than when using binary visible units.

#### Rectiﬁed linear units

• Histograms of the weights, the visible biases and the hidden biases are very useful.
• It is useful to examine histograms of the increments to these parameters when they are updated (but not after each update).
• When there is (spatial or temporal) structure in the visible units it is very helpful to display, for each hidden unit, the weights connecting that hidden unit to the visible units. Gray-scale displays of receptive fields are usually better.
• For a single minibatch, it is very useful to see a two-dimensional, gray-scale display with a range of [0,1] that shows the probability of each binary hidden unit on each training case in a mini-batch.
• The distribution of each visible unit is deﬁned by a parabolic log likelihood function:

$- log~p(\mathbf{v},\mathbf{h}) = \sum\limits_i \frac{(v_i - c_i)^2}{2\sigma_i^2} - \sum\limits_j b_jh_j - \sum\limits_{i,j} \frac{v_i}{\sigma_i} h_jw_{ij} + const$

• In practice, we renormalize the data to have zero mean and unit variance.
• The conditional distributions (assuming $\sigma_i = 1$) are:

$p(h_j = 1 |\mathbf{v}) = f(b_j + \sum\limits_i v_iw_{ij})$ $p(v_i|\mathbf{h}) = \mathscr{N}(c_i + \sum\limits_j h_jw_{ij},1)$

• Where $f(\cdot)$ is the logistic function and $\mathscr{N}(\mu,V)$ is a Gaussian, $b_j$ and $c_i$ are the biases for hidden and visible units and $w_{ij}$ the weights.
• Learning rule for hidden biases is simply: $\Delta b_j \propto \langle h_j \rangle_{data} - \langle h_j \rangle_{recon}$

#### Conditional RBM model

• We can model temporal dependencies by treating the visible variables in the previous time slice as additional fixed inputs.
• Adding 2 types of directed connections:
1. autoregressive connections from the past n conﬁgurations (time steps) of the visible units to the current visible conﬁguration.
2. connections from the past m visibles to the current hidden conﬁguration
• ⇒ These connections turn the RBM into a conditional RBM (CRBM)
• In experiments: n = m = 3
• Additional data in this CRBM training is simply treated as a dynamic bias
• Learning rule for the temporal connections that determine the dynamically changing hidden unit biases is:

$$$\Delta d_{ij}^{(t-q)} \propto v_i^{t-q} (\langle h_j^t \rangle_{data} - \langle h_j^t \rangle_{recon})$$\label{dir_v_h}$

• with $d_{ij}^{(t-q)}$ : log-linear parameter (weight) connecting visible unit i at time t − q to hidden unit j for q = 1..n
• Similarly, the learning rule for the autoregressive connections that determine the dynamically changing visible unit biases is:

$$$\Delta a_{ki}^{(t-q)} \propto v_k^{t-q} (v_i^t - \langle v_i^t \rangle_{recon})$$\label{dir_v_v}$

• with $a_{ki}^{(t-q)}$ : is the weight from visible unit k at time t − q to visible unit i.
• The autoregressive weights can model short-term temporal structure very well, leaving the hidden units to model longer-term, higher level structure.
• We do not attempt to model the ﬁrst n frames of each sequence
• The updates are only conditional on the past n time steps, not the entire sequence.
• To save computer memory, time frames are not actually replicated in mini-batches; we simply use indexing to simulate the “chunking” of frames.

#### Approximations

• ⇒ some approximations taken when applying the update rules.
• Python implementation of CRBM training: https://gist.github.com/gwtaylor/2505670
• The model can be used to generate walking/running sequences
• A 2 level model was also successfully trained to generate transitions between walking and running.
• Once we have trained the model, we can add layers like in a Deep Belief Network [14]. The previous layer CRBM is kept, and the sequence of hidden state vectors, while driven by the data, is treated as a new kind of “fully observed” data.
• If we augment the data with some static skeletal and identity parameters (in essence mapping a person’s unique identity to a set of features), we should be able to use the same generative model for many different people, and generalize individual characteristics from one type of motion to another.

# Recurrent Neural Networks

• RNN very powerfull because of 2 properties:
1. Distributed hidden state that allows them to store a lot of information about the past efficiently
2. Non-linear dynamics that allows them to update their hidden state in complicated ways.
3. Deep ones work even better.
* RNN allow us to operate over **sequences** of vectors.
* Check http://deepmind.com/
• Written as a class, the RNN's API consists of a single step function:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
• Example implementation with a single hidden vector h with vanilla RNN:
class RNN:
# ...
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
• We can stack multiple RNNs with:
y1 = rnn1.step(x)
y = rnn2.step(y1)
• From Jurgen Schmidhuber
• Sequential problems:
1. Control of attention
2. Sequence recognition
3. Motor control
4. Almost every real world task
• Other sequence learners:
1. Hidden Markov Models: discrete, cannot store real values.
2. Symbolic approaches
3. Heuristic program search: no direction for search in algorithm space
4. Universal search
5. Fastest algorithm
6. Optimal ordered problem solver
• LSTM is about “error carrousels”, “cells” and “blocks” ?