I am using word embeddings from FastText.
I have converted the words to index numbers in my vocab (from FastText), then used a (non-trainable) TensorFlow Embedding layer to convert the index numbers to word vectors using the pre-trained FastText embeddings.
The labels are multi-hot encoded (there are 100 labels).
The output activation is sigmoid and the loss is binary crossentropy, as that is what many websites recommend.
I have just split the train/validation/test sets randomly for now, so they do not take into account of the labels.
When I train the CNN, the "accuracy" gets to 0.99 very quickly and the loss is low.
At the end of each epoch the precision, recall and F1 scores gradually improve, then plateau at around 0.35.
The predictions are poor, with the maximum probability from the sigmoid output often as low as 6%, so the network does not seem to be properly trained. With a high accuracy and low loss, any training will be very slow anyway.
As there is a fairly large skew in the number of times each label has been allocated, I have used the class_weight parameter when fitting, to try to assist the training.
Does anyone have the experience to point to where I should start my investigation? There are so many things to twiddle!
Perhaps the transfer-learning expert will have some ideas.