sysid blog

Machine Learning Journey

Cheat Sheet

General Explanations:


  1. Add more data
  2. Use data augmentation
  3. Use architectures that generalize well
  4. Add regularization
  5. Reduce architecture complexity.

Recommendation and Tricks

General Neural Networks
Data Augmentation
Pseudo Labeling, Semi-Supervised Learning
Transfer Learning
Data Leakage/Metadata
Batchnorm, Batch Normalization


Rule of thumb: For a three layer network with n input and m output neurons, the hidden layer would have sqrt(n*m) neurons.

number of hidden layers



Systematic analysis of CNN parameters:


Predict multiple steps
  1. “Function to Function Regression” which assumes that at the end of RNN, we are going to predict a curve. So use a multilayer perceptron at the end of RNN to predict multiple steps ahead. Suppose you have a time series and you want to use its samples from 1, …, t to predict the ones in t+1, …, T. You use an RNN to learn a D dimensional representation for the first part of time series and then use a (D x (T-t)) MLP to forecast the second half of the time series. In practice, you do these two steps in a supervised way; i.e., you learn representations that improve the quality of the forecast.
  2. tbd


The first dimension in Keras is the batch dimension. It can be any size, as long as it is the same for inputs and targets. When dealing with LSTMs, the batch dimension is the number of sequences, not the length of the sequence.

Basic timeseries data has an input shape (number of sequences, steps, features). Target is (number of sequences, steps, targets). Use an LSTM with return_sequences.


One-to-one: equivalent to MLP.

model.add(Dense(output_size, input_shape=input_shape))

One-to-many: this option is not supported well, but this is a workaround:

model.add(RepeatVector(number_of_times, input_shape=input_shape))
model.add(LSTM(output_size, return_sequences=True))


model = Sequential()
model.add(LSTM(n, input_shape=(timesteps, data_dim)))

Many-to-many: This is the easiest snippet when length of input and output matches the number of reccurent steps:

model = Sequential()
model.add(LSTM(n, input_shape=(timesteps, data_dim), return_sequences=True))

Many-to-many when number of steps differ from input/output length: this is hard in Keras. I did not find any code snippets to code that.



Momentum (0.9)

For NN’s,the hypersurface defined by our loss function often includes saddle points. These are areas where the gradient of the loss function often becomes very small in one or more axes, but there is no minima present. When the gradient is very small, this necessarily slows the gradient descent process down; this is of course what we desire when approaching a minima, but is detrimental otherwise. Momentum is intended to help speed the optimisation process through cases like this, to avoid getting stuck in these “shallow valleys”.

Momentum works by adding a new term to the update function, in addition to the gradient term. The added term can be thought of as the average of the previous gradients. Thus if the previous gradients were zig zagging through a saddle point, their average will be along the valley of the saddle point. Therefore, when we update our weights, we first move opposite the gradient. Then, we also move in the direction of the average of our last few gradients. This allows us to mitigate zig-zagging through valleys by forcing us along the average direction we’re zig-zagging towards.


Adagrad is a technique that adjusts the learning rate for each individual parameter, based on the previous gradients for that parameter. Essentially, the idea is that if previous gradients were large, the new learning rate will be small, and vice versa.

The implementation looks at the gradients that were previously calculated for a parameter, then squares all of these gradients (which ignores the sign and only considers the magnitude), adds all of the squares together, and then takes the square root (otherwise known as the l2-norm). For the next epoch, the learning rate for this parameter is the overall learning rate divided by the l2-norm of prior updates. Therefore, if the l2-norm is large, the learning rate will be small; if it is small, the learning rate will be large.

Conceptually, this is a good idea. We know that typically, we want to our step sizes to be small when approaching minima. When they’re too large, we run the risk of bouncing out of minima. However there is no way for us to easily tell when we’re in a possible minima or not, so it’s difficult to recognize this situation and adjust accordingly. Adagrad attempts to do this by operating under the assumption that the larger the distance a parameter has traveled through optimization, the more likely it is to be near a minima; therefore, as the parameter covers larger distances, let’s decrease that parameter’s learning rate to make it more sensitive. That is the purpose of scaling the learning rate by the inverse of the l2-norm of that parameter’s prior gradients.

The one downfall to this assumption is that we may not actually have reached a minima by the time the learning rate is scaled appropriately. The l2-norm is always increasing, thus the learning rate is always decreasing. Because of this the training will reach a point where a given parameter can only ever be updated by a tiny amount, effectively meaning that parameter can no longer learn any further. This may or may not occur at an optimal range of values for that parameter.

Additionally, when updating millions of parameters, it becomes expensive to keep track of every gradient calculated in training, and then calculating the norm.


very similar to Adagrad, with the aim of resolving Adagrad’s primary limitation. Adagrad will continually shrink the learning rate for a given parameter (effectively stopping training on that parameter eventually). RMSProp however is able to shrink or increase the learning rate.

RMSProp will divide the overall learning rate by the square root of the sum of squares of the previous update gradients for a given parameter (as is done in Adagrad). The difference is that RMSProp doesn’t weight all of the previous update gradients equally, it uses an exponentially weighted moving average of the previous update gradients. This means that older values contribute less than newer values. This allows it to jump around the optimum without getting further and further away.

Further, it allows us to account for changes in the hypersurface as we travel down the gradient, and adjust learning rate accordingly. If our parameter is stuck in a shallow plain, we’d expect it’s recent gradients to be small, and therefore RMSProp increases our learning rate to push through it. Likewise, when we quickly descend a steep valley, RMSProp lowers the learning rate to avoid popping out of the minima.


Adam (Adaptive Moment Estimation) combines the benefits of momentum with the benefits of RMSProp. Momentum is looking at the moving average of the gradient, and continues to adjust a parameter in that direction. RMSProp looks at the weighted moving average of the square of the gradients; this is essentially the recent variance in the parameter, and RMSProp shrinks the learning rate proportionally. Adam does both of these things - it multiplies the learning rate by the momentum, but also divides by a factor related to the variance.



Problem Frameing

Time Series
Sentiment Analysis

LTSM: input sequence -> classification

Anomaly Detection

nietsche: come with a sequence and let it predict an hour into the future and look when it falls outside


it is ordered data -> 1D convolution each word of our 5000 categories is converted in a vector of 32elements model learns the 32 floats to be semantically significant embeddings can be passed, not entire models (pretrained word embeddings) word2vec (Google) vs. glove

Model Examples

### Keras 2.0 Merge
# Custom Merge:
def euclid_dist(v):
    return (v[0] - v[1])**2

def out_shape(shapes):
    return shapes[0]

merged_vector = Lambda(euclid_dist, output_shape=out_shape)([l1, l2])

mix = Input(batch_shape=(sequences, timesteps, features))
lstm = LSTM(features, return_sequences=True)(LSTM(features, return_sequences=True)(mix))
tdd1 = TimeDistributed(Dense(features, activation='sigmoid'))(lstm)
tdd2 = TimeDistributed(Dense(features, activation='sigmoid'))(lstm)
voice = Lambda(function=lambda x: mask(x[0], x[1], x[2]))(merge([tdd1, tdd2, mix], mode='concat'))
background = Lambda(function=lambda x: mask(x[0], x[1], x[2]))(merge([tdd2, tdd1, mix], mode='concat'))
model = Model(input=[mix], output=[voice, background])
model.compile(loss='mse', optimizer='rmsprop')

### Bidirectional RNN
xin = Input(batch_shape=(batch_size, seq_size), dtype='int32')
xemb = Embedding(embedding_size, mask_zero=True)(xin)

rnn_fwd1 = LSTM(rnn_size, return_sequence=True)(xemb)
rnn_bwd1 = LSTM(rnn_size, return_sequence=True, go_backwards=True)(xemb)
rnn_bidir1 = merge([rnn_fwd1, rnn_bwd1], mode='concat')

predictions = TimeDistributed(Dense(output_class_size, activation='softmax'))(rnn_bidir1)

model = Model(input=xin, output=predictions)

### Multi Label Classification
# Build a classifier optimized for maximizing f1_score (uses class_weights)

clf = Sequential()

clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))

clf.compile(optimizer=Adam(), loss='binary_crossentropy'), yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)

preds = clf.predict(xs)

preds[preds>=0.5] = 1
preds[preds<0.5] = 0

print f1_score(ys, preds, average='macro')

Principal Component Analysis (unsupervised)

    pca = decomposition.PCA()

    # As we can see, only the 2 first components are useful
    pca.n_components = 2
    X_reduced = pca.fit_transform(X)


many more, which I do not remember…

#python #learning #ai #machine learning