• Post Reply Bookmark Topic Watch Topic
  • New Topic

parse many csv from external provider in parallel  RSS feed

 
David Spades
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I'm trying to process multiple csv at the same time. My code looks like this :


external lib used : apache commons-io 2.4

if I set the static "count" variable to 1, it would complete in 600-700 ms with my connection. When it's 2, I'll get around 1100-1400ms, when 3, it's 1700-1900ms and so on.
From this statistic, the processing seem to be sequential, not parallel.
This would take a lot of time when the "count" is 40.
cant feel the parallelism here

FYI: the processing of even 50 would be very fast if this line is commented:

and each csv will have around 6700+ lines

Am I missing something here?

Thanks
 
Rodion Gork
Ranch Hand
Posts: 47
1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I may be missing something, but you probably spend your time not in processing, but in connecting.

Try to surround statement for opening and reading the connection with System.currentTimeMillis and substract two values.

If it is really the case, then probably the server makes you wait until accepting new connection (normal practice against DDoS).

Ah, sorry, I see you have no processing at all. So I believe it is just what I am speaking about. It is worth to try checking this suggestion with some standalone tool like "curl"...
 
David Spades
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for the reply.



This is the processing of the csv sent by the endpoint. if you comment that line out, you'll see the "parallelism" there. each stats would print like around 100-200 ms
 
Winston Gutkowski
Bartender
Posts: 10573
65
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
David Spades wrote:Am I missing something here?

Simple answer: Dunno.

But you could add a "Thread nnnn started at HH:MM:SS,sss" display inside your Runnable constructor, and maybe a few others - eg: after the connect() and after the openStream(). That might give you a better idea of how "concurrently" they're running.

Just make sure you can keep track of which Thread is which. :-)

Winston
 
Paul Clapham
Sheriff
Posts: 22376
42
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
David Spades wrote:

This is the processing of the csv sent by the endpoint. if you comment that line out, you'll see the "parallelism" there. each stats would print like around 100-200 ms


More precisely, what that code does is to read the data from the website and process it. So it's still possible that it's the reading from the website which is being serialized.
 
David Spades
Ranch Hand
Posts: 348
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hmm, I think that's highly unlikely since this endpoint is for public consumption, so if they serialize any of the process, it would be really really slow. but who knows....
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!