Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Paul Clapham
  • Jeanne Boyarsky
  • Knute Snortum
Sheriffs:
  • Liutauras Vilda
  • Tim Cooke
  • Junilu Lacar
Saloon Keepers:
  • Ron McLeod
  • Stephan van Hulst
  • Tim Moores
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Joe Ess
  • salvin francis
  • fred rosenberger

Trouble Inserting a csv with mySQL Workbench

 
Ranch Hand
Posts: 100
1
MyEclipse IDE MySQL Database Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a csv with over 20 million rows and the csv indicates that it has 6 columns. When I try to insert it, I got: "Error Code 1262. Row 2399 was truncated; it contained more data than there were input columns." So I added another column to the db table. I then got: "Error Code 1261. Row 1 doesn't contain data for all columns."

It is too big to open completely with Excel, I can only get a preview of the first 200 rows. I'm not sure what to do. Any recommendations are greatly appreciated.
 
Saloon Keeper
Posts: 21593
146
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Your CSV file is defective.

Before you can fix the database, first you need to find out what's wrong with the CSV.

Obviously, you have at least one row with a differing number of columns than the other rows do, but what are those columns supposed to hold to begin with? If you just jam them in, you may end up with columns that incorrectly contain values intended for other columns.

So you need some sort of analysis here.

Since this is Big Data forum, I'll tell you what I do in cases like this. I use a dedicated Big Data tool, which in my case is generally the Pentaho DI program ("kettle"). Pentaho DI is one of a class of programs called "ETL" tools, for Extract, Transform, and Load. Perhaps the most famous of that group is Talend, but I just happened to land on Pentaho, and I liked it, so that's what I use.

PDI works as a pipelined processor, where the first step brings in data (as for example, from a CSV file), intermediate steps screen for bad data, repair things and do general cleanup jobs, and then one or more steps output data to the destination(s). There's a GUI design tool for such pipelines called Spoon (the entire suite is named after kitchen tools), and many of the steps have the ability to route bad data rows out of the main stream and into error files where they're easier to examine (since you don't have to wade through millions of good lines to get to them). Once you've identified the problem(s), you can add steps to the pipeline to deal with them, which may include putting default values in columns that are usually missing, cleaning up date/time values or whatever.
 
Ray Gilbert
Ranch Hand
Posts: 100
1
MyEclipse IDE MySQL Database Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Tim. As a complete noob who has never dealt with this nor ever used Pentaho DI, what do you think my chances are of sucessfully doing this myself? *Gulp*
 
Tim Holloway
Saloon Keeper
Posts: 21593
146
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Pentaho business suite (of which their Data Interchange product is just one component) is intended to be used by non-programmers. The Spoon flow design tool is very much DDD (drag, drop, drool).

Having said that, there is a somewhat peculiar way that things are done using PDI, so one has to get used to that.

There are, incidentally books published on this product. Pentaho itself is owned by Hitachi Data Systems, but you can obtain many of the Pentaho products in Community Edtion (free) releases.

The whole thing is written in Java, incidentally. In fact, some of the CSV code was improved by me, because I found the original form a bit cryptic to use.
 
Ray Gilbert
Ranch Hand
Posts: 100
1
MyEclipse IDE MySQL Database Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I must admit I am a bit intimidated. Do I need a workstation to run this? I am currently operating with an HP core i3, 8 Gigs of ram, 1 Terabyte HD with 478 Gigs used.
 
Tim Holloway
Saloon Keeper
Posts: 21593
146
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I won't say you can run it on a Raspberry Pi - Java doesn't perform exceptionally well on that machine (at least before the Pi 4). But any box than can run an IDE has easily enough horsepower to run PDI.

The actual Kettle engine itself is extremely lightweight, in fact.
 
Ray Gilbert
Ranch Hand
Posts: 100
1
MyEclipse IDE MySQL Database Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Tim Holloway wrote:I won't say you can run it on a Raspberry Pi - Java doesn't perform exceptionally well on that machine (at least before the Pi 4). But any box than can run an IDE has easily enough horsepower to run PDI.

The actual Kettle engine itself is extremely lightweight, in fact.




Thank you ever so much Tim for your time and wisdom. It has been most helpful.
 
I guess I've been abducted by space aliens. So unprofessional. They tried to probe me with this tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!