• Post Reply Bookmark Topic Watch Topic
  • New Topic

Tips on how to handle formating columns to a desire format with 100k+ rows

 
Paul Monique
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello JavaRanch Community,
i'd been using Opencsv to upload all this data into my Db(Postgres) using EclipseLink with batch inserting, it wont take more than 5 secs to load 200k+ data cause all the columns are of type string so theres no format require, the problem comes when i need to give a special format to the data that is in this table (date, Integer, etc).
Right now how it works:
- Ill go row by row (when its required) verifying the format of the data and convert it with something like this Ex: Date date = Fechas.strToDate(data, Pattern) and fill the new Object with this info

what i'm planing to do
- With the function of EclipseLink OPERATOR im gonna use that to change all the rows of a column that requires a NUMBER format with OPERATOR('ToNumber',column1,'9999999999')
i cant do the same for Date cause ill get an error if the data doesn't have a Date like pattern
I was wondering if i could get some tips about how to handle this Date formatting(from a query, or directly in java).

Already been all over google, stackoverflow and the javaRanch forum and some blogs, perhaps im not asking the right questions to their search engines, any ideas?
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I know my advise here will go against the WORA philosophy that Java people adhere too, but I think it's worth it

Generally, if you are going to be loading a lot of data in the database, it's better to use native bulk loading routines provided by the database. This allows the database to optimize the process. I've use oracle and MySQL, and the bulk load commands outperform trivial java programs. You can make java load data faster, but you have to spend a lot of effort doing it, and really you reinvent the wheel that the db developers have invented.

I suggest looking into Postgres documentation regarding bulk loading of data. Generally, bulk loading routines have a way to do simple formatting of data as they load.
 
Paul Monique
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Great thanks for the advice, im already doing the bulk load trough the database cause to me it out performs Java, and im doing some data type formatting trough the DB as well, i can do String and number formats, but when it comes down to Date it would throw me an error if a single data has a error in its pattern, or if it has a dot or an extra letter so im not sure how to go about it

OPERATOR('toDate',a.column24,'DD-mon-YYYY') if the data differs from the 'DD-mon-YYYY' pattern it would throw me the error (im trying to handle everything trough JPA even the bulk updates i trying not to create a function for this on the db)

Thanks against for the replies!
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The way I have handled this before is by having a 2 step import
a) Load the records into a staging table. The staging table has the same fields, but everything is string
b) Validate the records using SQL queries, and discard invalid records and/or log them in an error table
c) Insert the records from staging table into the final table

THis works with things that require basic validation and transformation. If you have validation/transformation rules that are complicated, you will be better off looking at ETL libraries, rather than implementing it yourself. You can do it. It's just not a good idea to reinvent the wheel.

If you really want to insist on re-inventing the wheel, then you will have o start thinking about parallelizing your load operations. ie; split the input file into 10, and start 10 threads, one to load each file. Also, you will have to make sure you load the file in chunks, rather than try to create 10,000 records in memory. Also, you will need to use bulk updates, except that 10,000 is probably too large of a size for a bulk update, because you start putting too many things in memory. So, you will have to bulk update in batches of 1000. You will have to make sure when you parse each line, you use the Pattern class. However, don't create a Pattern object everytime. Create it once and keep it. Also make sure you disable indexes before you start loading, and enable them after you are done. If your table is partitioned and the DB allows it, you can probably do some magic with the indexes

There are a hundred other things that I might be forgetting. Really, it;s not worth it
 
Paul Monique
Greenhorn
Posts: 24
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jayesh A Lalwani wrote:The way I have handled this before is by having a 2 step import
a) Load the records into a staging table. The staging table has the same fields, but everything is string
b) Validate the records using SQL queries, and discard invalid records and/or log them in an error table
c) Insert the records from staging table into the final table
THis works with things that require basic validation and transformation. If you have validation/transformation rules that are complicated, you will be better off looking at ETL libraries, rather than implementing it yourself. You can do it. It's just not a good idea to reinvent the wheel.


Great, im already doing a) point, but im gonna validate the records with a querie like on point b)(for the Date pattern im gonna look in to regex function) and finally load the data in to the final table (the validation i require at the point c) doesnt take to much time cause is not that complex).
Thanks a lot Jayesh!!

 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!