HALINA9000 - Blog - Caviar strategy with Early Stopping in single neuron training

Blog

Results of caviar strategy with Early Stopping in single neuron training - analysis of Coursera cat detector.

started: March 20, 2018
last modified: September 16, 2018

Here you'll find results of my own experiments inspired by programming assignment from first course of Coursera Deep Learning specialization - cat detector.

Initial play with hyperparameters and - later - more systematic search led me to much better accuracy and interesting observations.

If you are not familiar with Caviar strategy and Early Stopping I recommend following videos:
Hyperparameters tuning in practice: Pandas vs. Caviar
Other regularization methods

All necessary files you'll find here: https://github.com/HALINA9000/ai.exp.0001.cat_detector

Content

0. Loading libraries and datasets
1. Description of the original problem
2. Optimal batch size
3. Original programming assignment in Keras and its results
4. Original assignment with random initialization
5. Comparison of results
6. Learning rate tuning
7. Random sampling of hypersurface
8. Comparison of results with all related tasks from the course
9. Analysis of best weights

0. Loading libraries and datasets

Before we'll go any further let's load necessary libraries. Then load both training and test datasets.

1. Description of the original problem

The initial task was to teach the neuron to recognize the cat on image. It was just single neuron with sigmoid activation function. Both train and test image datasets were provided by Coursera. All images are RGB, 64 × 64 pixels.

Quick view what we have inside both datasets:

Cat images in training set: 34% (72/209).
Cat images in testing set: 66% (33/50).

So, we have small amount of images. Additionally - proportion of cat vs. non-cat images in both datasets is different. That's a bit surprising and troublesome - what is baseline value in this case? But let's continue with no changes.

2. Optimal batch size

Now it is worth to determine optimal i.e most efficient batch size.

size:    time:
----- --------
  256   6.3744
  128   7.0324
   64   7.9375
   32   8.4335
Most efficient batch size is: 256

With batch size equal to 256 samples we only need 76% of time compared to the default value in Keras i.e. 32.

3. Original programming assignment in Keras and its results

Let's recreate the original task from the course but in Keras!

OK. It's done. After 2000 iterations with learning rate = 0.005 we have almost 100% accuracy on training set (blue line) and 70% on test set (orange line). That means we have correctly recreated original assignment from the course.

But the chart above provide us with few additional observations:

Overfitting

It is easy to notice that after initial period accuracy on training set goes up while accuracy on test set goes down. The difference is almost 30%.
Instability

Up to iteration 350 both charts behave very chaotically.
Similar values at early stage

Both accuracies achieved very similar value (about 80%) near iteration 250.

4. Original assignment with random initialization

And now with random initialization:

The chart looks similar but is not identical to the previous one. A common features are overfitting and chaotic behavior in the early epochs.

5. Comparison of results

It's time to explain the role of custom callback BestAccs.

It has been created to find and save as h5 file weight values when model has good accuracy on both training and test set. Additional criterion is small difference between both accuracies. In both cases above good accuracy means greater or equal to 0.7 with maximum difference 0.02.

In other words: it performs Early Stopping. It works in similar way as standard Keras callback: ModelCheckpoint. But ModelCheckpoint takes into consideration only one of both accuracies and has no lower accuracy limit when weights are going to be saved.

Let's find best files and corresponding weights for 'zeros' and 'random_uniform' initialization. Next check what is the distance between them.

Best weight with zero initialization in 0.78-0.78-0.78-zeros.h5 file.
Best weight with random initialization in 0.78-0.79-0.78-random_uniform.h5 file.

Second number in each file name is accuracy on training set, third number - on test set. First number is their minimum. This form of file name will be very usefull soon.

In both cases we have quite similar accuracy. Do both vectors (kernels) points to the same minimum?

Norm of kernel with zeros initialization:  0.5559
Norm of kernel with random initialization: 3.2183
Norm of vector difference between them:    3.1710
Angle between kernels (rad):               1.3990

It seems we have two different minimums. It shouldn't be surprise. A cat is is complex enough creature to have more than one features that allow to distinguish it from other objects.

6. Learning rate tuning

Let's find out which learning rate will be the best i.e. gives the most chaotic accuracy chart.

7. Random sampling of hypersurface

Learning rate equal to 0.1 does produce the most noisy chart. OK, let's proceed with that value!

I ran this code many times. It means sampling has been performed many thousands of times. And the best result is:

0.92-0.93-0.92-iteration-156.h5

In fact this file has a bit different accuracy.

It's due to moment when accuracy has been measured by BestAccs callback.

Accuracy on training set: 0.947
Accuracy on test set:     0.920

It is worth emphasizing that the results are obtained with no regularization method except Early Stopping i.e. no L1 & L2 regularization and data augmentation.

8. Comparison of results with all related tasks from the course

Later in the course more complex networks were trained to use as cat detector:

2-layer network with 7 neurons on 1st layer and 1 neuron on 2nd layer
4-layer network with 20 neurons on 1st layer and then 7 neurons, 5 and finally 1

Below we have a summary of results.

dataset	1 neuron (1) caviar strategy	1 neuron (1) original course assignment	2 layers (7, 1) original course assignment	4 layers (20, 7, 5, 1) original course assignment
training	0.947	0.990	1.000	0.986
test	0.920	0.700	0.720	0.800

9. Analysis of best weights

Now let's choose weights with accuracy on both training and test datasets >= 90%

Angles between vectors
Minimum: 1.5215
Maximum: 1.5655

Norm of vectors
Minimum: 64.0330
Maximum: 65.2384

Norm of difference between vectors
Minimum: 89.1946
Maximum: 91.3843

That's serious surprise for me. I expected different points on analyzed hypersurface. But we have something more: all vectors have very similar norm (length) and are almost orthogonal! And I have no idea what does that mean. Not yet.

To be continued...