For almost three months now I have been trying to build and train a convolutional network that will recognise chess puzzles: but I don’t feel I am any closer to succeeding with it than I was at the start of September and so I wonder if I should just give up.
The network itself is built, and as far as I can see, works except for the fact that I just cannot get it to converge on the training set.
The (learning) code is here: https://github.com/mcmenaminadrian/ChessNet/tree/learning
The training set is here: https://github.com/mcmenaminadrian/ChessNet/tree/learning/images
There are 25 possible classes of outcome – from an empty white square to a black square hosting a black king, and the network outputs a value between -1 (no match) and 1 (perfect match).
There are 25 convolutional fibres each with seven layers, going from a 100 x 100 input layer to a final filter (feature map) of 88 x 88 which are then fully connected to 25 output neurons (there is no pooling layer): as you can see that means there are 88 x 88 x 25 x 25 + 25 (4.84 million, plus 25 for bias) connections at the final, fully connected, layer (or alternatively each output neuron has 193601 input connections).
Perhaps the issue is that the scale of the fully connected layer dwarfs the output and influence of the feature maps? I don’t know, but what I do know that, as training goes along, the output neurons generally begin in a low (i.e., close to -1) state and then edge towards a high state, but as they do they are suddenly overwhelmed and everything returns to an even lower state than before.
Envisaging this as a three dimensional surface, we creep up a steep hillside and then fall down an even deeper hole just as we appear to be getting towards a summit: the problem seems to be that training doesn’t really teaching the network to differentiate between any of the training images, it just pushes the network towards a high value. Then, suddenly images which should be reported as low are reported as high and the error values flood the network on back propagation.
To explain further: in the training set our image X will will always be relatively infrequently seen so most results should be low and are low, with small error values (deltas as they are usually called) – so small that they are generally ignored. The deltas for X are then large and they feed into the network, dragging our response towards high. Eventually we cross a threshold and all the results – for good and bad images – are reported as high and so there lots of big deltas which overwhelm the small number of correct positives. At least that is what I think is happening.
Of course what really should happen is that the network learns to discriminate between the ‘good’ and ‘bad’ images, but that just seems as far away as ever.
Any tips, beyond giving up, gratefully received.