1) In your example you use System.out.prinltln every implementation I have seen synchronizes so your code is synchronized ( I appreciate that was not your point ).
2)
I don't want to perform synchronization if I don't have to. Accessing a volatile variable incurs half of the cost of a synchronized block.
There is no guarantee of cost to volatile or the synchronized keyword and indeed in extreme scenarios the JVM would be allowed to optimize them away to nothing (not here).
When talking about "synchronization" and performance you are interested in three things lack of potential compiler optimization, cache flushes (memory fences) and lock acquisition (often only the first is considered). The JVM authors have put a lot of effort into some excellent optimizations around all three, potential lock acquisition has many tricks to make it less expensive than you'ed think.
3) The atomic implementation may make use of available CAS (lockless) instructions and the JVM implementation could make use of the fact the actual hardware is more strongly ordered than the weak model Java describes for the code to be portable . Although a processor may/will have caches those caches can and often do communicate with each other (between CPUs) such that much stronger memory ordering is observed than you might expect (e.g. Intel x86 is quite strong). The Java memory model is more like the worst case scenario and would formerly be described as weak.
Short answer ... just use atomic you'll be fine ;-) there are a whole host of other gotchas around performance with even simple examples eg "false memory sharing" and even old garbage collection I've seen applications grind with too many (badly used) Atomics (GC) .. synchronization can/could be just a flag on a method signature its GC free (not in anyway a normal a reason to not use atomics obviously).