Is RL a supervised learning technique, or unsupervised? Or is it somewhere between?
It seems to be supervised in the sense that (I think) you need to reward good outcomes and penalise bad outcomes, so there is some goal to aim for.
However, the supervision seems to be much less precise than other supervised techniques, where you label training data with a "correct answer".
It seems to be more than unsupervised, as you are giving the algorithm some guidance.