This week's book giveaways are in the Jython/Python and Object-Oriented programming forums.
We're giving away four copies each of Machine Learning for Business: Using Amazon SageMaker and Jupyter and Object Design Style Guide and have the authors on-line!
See this thread and this one for details.
Win a copy of Machine Learning for Business: Using Amazon SageMaker and JupyterE this week in the Jython/Python forum
or Object Design Style Guide in the Object-Oriented programming forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Paul Clapham
  • Jeanne Boyarsky
  • Knute Snortum
Sheriffs:
  • Liutauras Vilda
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Joe Ess
  • salvin francis
  • fred rosenberger

How many ticks does it take to fill an array?

 
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is my first post here, so here we go:

I did a benchmark for java array handling, and it came out looking like I have a rootkit. Does anybody know what java does behind the scenes when iterating over an array?

Here's the code:



You can run it yourself in the file here: https://drive.google.com/file/d/1dWBYkmzMurApCihs_Y7OiruYQzc1eoue/view?usp=sharing

Here were my results:



I was expecting this, having run similar benchmarks before, but as you can see, my processor has an internal advertisment for 2 gigahertz. I get 5 megs. Where did they all go?
 
Saloon Keeper
Posts: 2943
373
Android Eclipse IDE Angular Framework MySQL Database TypeScript Redhat Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here is the code:
 
Marshal
Posts: 67464
257
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to the Ranch Thank you, Ron, for finding the complete code.

Jonathan Graef wrote:. . . my processor has an internal advertisment for 2 gigahertz. I get 5 megs. Where did they all go?

Divide your 2GHz by 400 ticks per operation and you will get 5MHz.
Don't know any more about that.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's what the code does to compute the number of ticks. However Intel says modern CPUs can do floating-point math in 1-4 ticks. My code uses integer math, which should be faster. Does it really take 1000+ ticks to access or update an element of an array? What is it doing, mining for bitcoin?
 
Rancher
Posts: 4450
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Out of curiosity, why do you think there are 27 operations occurring in that loop?
I assume that's why the increment is 27.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I counted. There are 27 statements, operators, and variable reads. Those roughly correspond to CPU instructions. So what I want to know is, where are the other 396 ticks doing, if each instruction takes 4 ticks?

I know java does some things for you. Array bounds checking. Double pointer. Um... Goto statement at the end of the loop? I counted that. I honestly can't think of anything else it could be doing. Where are my CPU cycles going in java?

C has the same performance problem - see https://www.stefankrause.net/wp/?p=4. Has Intel misreported its specs? They do say real performance may vary, but every once in awhile they advertise their fancy 1-4 tick instructions. Then there's the clock multiplier, not counted in my benchmark. See you processor family on: https://en.wikipedia.org/wiki/Instructions_per_second. i7 has 20-35 lil' ones per clock cycle.
 
Saloon Keeper
Posts: 6772
64
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jonathan Graef wrote:Here were my results:


...And my results:
What does it mean by "performance gap"?
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're getting way better performance than I am: 140 ticks per high-level language operation (plus, assignment, variable read, etc.). But why not 1-4, plus a few (5-10) for cleaning up CPU registers?

For example, I counted ai++ as 4 HLL operations: read ai, read 1 constant, add, and assign to ai. That's what the 27 operations per loop means: I counted all of them. But you're getting 4000 ticks per loop, not 27. That's down to a much more reasonable level, but the cutoff I put in the code is <= 100 ticks/HLL operation for no performance gap (see line 91 of the code).

TLDR; It says performance gap if there are more than 100 ticks per thingy you see in the source.
 
Dave Tolls
Rancher
Posts: 4450
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What else is your PC doing?
I doubt it's just running this code.

In addition I don't see any "warm up" time for the code in the loop.
So you're also timing the JIT (at a guess, not sure how all that works with newer JVMs now).
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I added a new JIT warmup phase. I also upped the process priority to High under Task Manager.

Link: https://drive.google.com/file/d/1tzz5oesc9gkAG-66GfwTCtePe-q7XXEb/view?usp=sharing

Here were my new results:



As you can see, there has been a performance improvement of around 30%, which brings us up to a better 300 instructions per operation. But that's still a lot. Where's my gigahertz? Arduinos are advertised to be faster than this...
 
Dave Tolls
Rancher
Posts: 4450
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is it doing anything else?
Just making the process "High" doesn't turn off any other processes.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You may be right. My "background processes," what they may be I'm not sure, are eating up 20-50% of my processor. So multiply the 10Mhz in my benchmark by that and you get 20Mhz, I might be able to get Arduino speeds out of HotSpot JVM if I ditch unwanted backwares.
 
Saloon Keeper
Posts: 11189
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I compiled and disassembled my own version of your code. Addresses may be different, and a few load instructions refer to method parameters rather than local variables, but in essence it should be the same if you disassembled your own test.

Addresses 5 to 49 refer to the instructions that are executed every loop iteration. As you can see, those are 28 bytecode instructions. You can reduce them by using int instead of long, because it will eliminate the lcmp at address 8, and the i2l at address 38.

I'm guessing that the most expensive instructions are the laload at address 32 and the lastore at address 40. The first operation loads the value of hash[ai] onto the operand stack, and the second operation assigns a value from the operand stack to hash[ai]. Now, while accessing the array from the L1 cache may very well take 4 processor cycles, accessing it from main memory can take up to hundreds of cycles. You simply can not predict whether and how your array will be cached. The same goes for your variables i and ai, but it's more likely that the optimizer will keep them loaded in registers.

You made two mistakes: First, you assumed that the number of processor cycles it takes to load and store a variable is comparable to the number of cycles it takes to operate on two loaded values. Instead, there may be two orders of magnitude difference.

Secondly and more importantly, while you didn't forget about JIT optimization, your benchmark isn't written correctly to account for this. Don't feel bad. The best programmer in the world can't do this without help. It took a team of experts to write a benchmark framework that gives accurate timings. Take a look at jmh and use it in the future when you write tests like this.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The 28 bytecode instructions roughly correspond to high-level language instructions, which is what I counted in the loop. But two orders of magnitude difference from 10 to 1000 for the two array accesses adds up to 2260 ticks per loop, or 80 ticks per operation average, in which case you would get "Java has performant code!" at the end of the benchmark.

According to Carey Brown above, a good PC gets ~150 ticks per operation average. DDR3 runs at the same clock speed as the CPU (modern RAM is actually faster), so that isn't the bottleneck. Maybe the school of bus failed or something. But in my opinion, either some extra cruft has been written by the Java devs, or the JiT is compiling late as you mentioned.

As for tools like jmh, I feel they fail to simulate a real app, as they first warm up the JiT after making any changes to variables, thus making the entire output of the app a fixed value. If I understand correctly.
 
Carey Brown
Saloon Keeper
Posts: 6772
64
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Jonathan Graef wrote:...adds up to 2260 ticks per loop, or 80 ticks per operation average

The benchmark code specifies opsPerLoop as 27. Given this new analysis, what would this be now?
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Stephan said it was 28 Java bytecode instructions per loop. That's off by one from my 27 estimate.

The numbers in my previous post were his estimate in processor ticks, not real results. I added it up, and gave it some pad: 10 "real" (actually, there's a multiplier) instructions per Java, instead of 4. My real results were 8318 ticks per loop, ~300 ticks per operation. As you can see, there's a disparity there.
 
Carey Brown
Saloon Keeper
Posts: 6772
64
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I may be a little dense here, but I'm trying to understand how to modify the benchmark code to take this new info into account.

Seems to me that if the array ops are off by a factor or more then the operations per loop must be affected similarly.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to change the two occurrences of 27l to 28l (one at the bottom), 27.0 to 28.0 in "Ticks per loop", and 27 to 28 in "Operations per loop". That is all.

It computes the actual number of CPU ticks per Java opcode that you are getting based on this number (but average, so e.g. array access is probably higher than addition), taking into account your PC's background processes are included in the count (check your task manager and multiply to get the correct number if you care, ticks * (1 - bitcoin / 100), where bitcoin is the percent 0-100 idle in background).

Ticks per loop tells you the total number of ticks it takes to loop once, by multiplying the average ticks per (Java bytecode) operation by 28.0.
 
Carey Brown
Saloon Keeper
Posts: 6772
64
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Slightly better but still says 'gap'.
 
Dave Tolls
Rancher
Posts: 4450
47
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Though there are 28 byte code instructions, there is nothing to say that those instructions are the equivalent to a single tick.
As Stephan says:
"
Now, while accessing the array from the L1 cache may very well take 4 processor cycles, accessing it from main memory can take up to hundreds of cycles.
"

So some of those instructions can take anywhere from 4 to 100s of times longer than you are calculating.
 
Stephan van Hulst
Saloon Keeper
Posts: 11189
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jonathan knows that, but has decided on a threshold of an average of 100 ticks per bytecode operation. Personally I find this a bit arbitrary, especially since we don't know how the JVM instruments code before it runs it, and the ticks were measured as if the JVM has the processor core all to itself.

Jonathan, I seriously suggest that you try to replicate your findings using jmh, which will at least take care of the instrumentation aspect.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Um, as I understand it, you load the two operands from memory, which is as fast as the CPU, then perform the operation. That's 3 ticks. What's java doing, compiling itself?

I gave it 100 for the possibly bs notion that memory access is slow. Or maybe java just compiles itself every time we change a variable and then runs for no reason. (no reason because the variables would all be compiled in statically if they recompile every time)

And what about C? It doesn't have a JiT.

Link: https://www.stefankrause.net/wp/?p=4

If you're getting 30mhz, they're probably getting 40. mhz. In C?

There's a full spectrum of possibilities, but I don't think activating naiive bench optimizations is going to cure our miserable Minecraft FPS. Or the experience. But we don't like to talk about that.
 
Stephan van Hulst
Saloon Keeper
Posts: 11189
244
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You are forgetting about CAS latency. Even if the RAM can transfer data at the same rate the processor can handle it, the entire roundtrip time of a single bit of data is still in the order of hundreds of cycles.

This means that if the processor always knows ahead of time what part of the memory it needs, the CAS latency will eventually be divided over all memory accesses, and total speed will approach memory clock speed.

Sadly, the processor can't always predict what part of memory it will need (especially around conditionals, or when an array index doesn't increment linearly), so upon a false branch prediction or cache miss it has to add the entire CAS latency to its single instruction time.

I bet your program will run a lot faster if you access the array with i instead of ai, and faster still if you let it increment in smaller steps.
 
Jonathan Graef
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Um, wut? I come from a background of logic and theoretical computing; I didn't know about CAS. It makes sense that RAM would be striped over multiple internal buses, I think. Maybe.

I read online that DDR3 RAM has around 15 nanosecond latency round trip. That's 66mhz. Divide by a reasonable 15 memory accesses per loop and you get 4.4mhz of memory accesses that can't be predicted.

I got 5mhz on my benchmark, so I would go so far as to say that Java is predicting hardly anything.

Long-story short: RAMs are latent, you need branch prediction and array preloading to get that sweet 2 billion instruction count, and since Java didn't detect the loop counter in this benchmark, it wasn't predicting which element needed to be preloaded before array[ai] was accessed. It may not do this every time, but I'm hopping to LuaJiT, which gets my 2ghz on the button. They predict it, I guess.
 
So it takes a day for light to pass through this glass? So this was yesterday's tiny ad?
Sauce Labs - World's Largest Continuous Testing Cloud for Websites and Mobile Apps
https://coderanch.com/t/722574/Sauce-Labs-World-Largest-Continuous
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!