• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • paul wheaton
  • Ron McLeod
  • Devaka Cooray
Sheriffs:
  • Jeanne Boyarsky
  • Liutauras Vilda
  • Paul Clapham
Saloon Keepers:
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:

Join operation on Excel files?

 
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My project is expecting a requirement that will involve joining data from one Excel file into another. Both files will have around 70,000 rows and the idea is that we will want to append the data from each row in file A onto the corresponding record in file B. It's not clear what the record matching criteria will be, but hopefully it will be a single - possibly numeric - ID, which will make life a little easier. However, it's not clear if the data will be sorted in advance or not.

As an old Oracle developer, my instinct would be to load the data into the database (we're using a PostgreSQL DB anyway) and do a nice simple SQL join there. Unfortunately, I don't think people will go for this option, so we will probably have to work on the files using Java instead.

I know there are plenty of APIs for reading/writing Excel files, but does anybody have any ideas about how we might implement this kind of "join" operation on Excel files? I can see this operation blowing up if we have to do it the "dumb" way - 2 billion comparisons? - but I don't have much background in algorithms, and my Java programming skills are still on the upward learning curve, so any tips would be welcome.

Thanks.

Chris
 
Ranch Hand
Posts: 32
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The first thought I had was the ODBC driver that is - well used to be - available from Microsoft. It seems - googling around - that it might still be possible to find a solution that will work with the later versions of Excel (2007 onwards).

Please forgive me for saying this but Excel ought not to be used for this purpose; spreadsheets are for organising and analysing data in tabular form, they do not replace databases. My own opinion, for what it is worth, is that your solution should be in two parts. Use a database to gather the data, SQL to get at the data in the manner you require and then populate an Excel workbook with that data. The 'population' step should be quick and easy - unless complex data analysis or charting is part of the equation - and you have lots of options with regards to tools and techniques; from commercial products such as ASPose, through JasperReports and on to POI or even JExcel.
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the reply, Mark. FWIW I agree with you completely: my instinct would be to treat the file simply as a transfer format for input/output, and do all the serious data-munging in a database. Unfortunately, past experience of encountering similar requirements for similar user groups within my organisation strongly suggests that we will be constrained to go down the file-based route, however inappropriate/inefficient that may be. So right now I'm trying to get a handle on the work involved in doing it this way, and potential short-cuts if anybody knows of any.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:Thanks for the reply, Mark. FWIW I agree with you completely: my instinct would be to treat the file simply as a transfer format for input/output, and do all the serious data-munging in a database. Unfortunately, past experience of encountering similar requirements for similar user groups within my organisation strongly suggests that we will be constrained to go down the file-based route, however inappropriate/inefficient that may be. So right now I'm trying to get a handle on the work involved in doing it this way, and potential short-cuts if anybody knows of any.


One possibility: could you do this with a pivot table or function? I'm not totally up on that stuff, but I suspect you could. Translating that to Java as a "batch" function (as with pretty much anything to do with MS Office software) is likely to be tricky though. That's the trouble with proprietary software.

Winston
 
Mark Beardsley
Ranch Hand
Posts: 32
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
..and another thought.

It should be possible to remain 'within' Excel - sort of - by doing something like this;

1. Export the Excel worksheets out into .csv (that assumes the data is in a format that allows you to do this, i.e. no images).
2a. Import the data into a database and perform the join there OR
2b. Work with the data in it's csv format and perform the sorting and merging of data directly to produce a single file of 'results'
3. Use this file to populate a new Excel workbook.

The manipulation of the workbooks is entirely possible using POI. Just as am example, it is easy - fairly - to get the contents of the workbooks into csv format - http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/ToCSV.java Building a new workbook is likewise straightforward easy and there are numerous examples to get you going here - http://poi.apache.org/spreadsheet/quick-guide.html . If you do decide to use POI, then I would suggest that you focus on the ss.usermodel classes as they make it possible to neatly sidestep the issue of file type; they will work out whether you are opening a binary - .xls - workbook or an OOXML - .xlsx - one. Further, you can create a single codebase and use it to create files of either type when it comes time to write a new workbook. I must declare an interest here I suppose as I was once involved with the POI project.

Winston's idea is also VERY interesting and something I wish I had thought of. It is quite possible to control Excel programmatically if you are working on a Windows machine. In the past, I have used JACOB and the OLE32 module of the Standard Widget Toolkit to perform file conversions and other processes this way. If you can create a VBA macro to accomplish what you require within Excel, them it is possible to use COM to do the same. The two problems are speed - it all works quite slowly - and the you have to be on a Windows platform. It might be worth speaking to an Excel guru within your organisation - or joining a forum - to ask about pivot tables and see whether they can be used. If so, try to get a handle on the VBA and then we can help with the COM part.
 
Mark Beardsley
Ranch Hand
Posts: 32
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just had a quick Google and found these pages that might be of interest, they all discuss consolidation of data;

http://office.microsoft.com/en-gb/excel-help/consolidate-data-from-multiple-worksheets-in-a-single-worksheet-HP010342300.aspx
http://www.extendoffice.com/product/kutools-for-excel/merge-excel-wordbooks.html
http://office.microsoft.com/en-gb/excel-help/consolidate-data-in-multiple-worksheets-HP010095249.aspx#BMuse_a_pivottable_report_to_consolidat

If one of these approaches yields the results you are after, you can record a macro of the key strokes and then convert this into COM code to allow the operation to be controlled by Java code - that's assuming running the macro would not be enough in the way of automation.
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the suggestions and links, chaps.

If it was just a case of appending rows to the bottom of a spreadsheet, it would be easy. The awkward part is that we have to join the rows between the two files i.e. we're appending columns to each record - and we'll need to consider how to respond to mismatches etc as well of course. I'm starting to feel that my initial instinct to just use the database is not only the right way to do this, but also so much better than the alternatives that I might even be able to face the inevitable battle to get the idea accepted. But we'll still be using a Java API to read/write the Excel files, so maybe all is not lost for the files-fraternity!

Cheers,
Chris
 
Bartender
Posts: 1051
5
Hibernate Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Chris

If converting your excel files to a fixed csv format is an option, I would take a look at the opencsv library.

http://opencsv.sourceforge.net/

Having a quick read on the site, you could read the rows straight into custom Java beans, key them by an ID and then write custom methods like:


 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:But we'll still be using a Java API to read/write the Excel files, so maybe all is not lost for the files-fraternity!


May I ask why? This sounds like a "batching" system to me, and if so, Excel is NOT the platform I would choose for it - not least because it is proprietary; and believe me, I've been down this road.

My advice: If this is indeed the case, get your spreadsheets converted to another format (either a database, or OpenOffice or LibreOffice format) as fast as you possibly can; because otherwise you will be constantly running up against a program (and a file format) that Microsoft does NOT want you to be using directly.

The problems with a different file format: You may not be able convert everything. OO and LibreOffice support 97/2000 format as far as I know.

The problems with a database: Similar, but for a different reason. You'll be reliant on whatever software you use to read the file (which is highly unlikely to be an MS product ).

HIH

Winston
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
@Winston: Why Excel? Good question, but that's what our business users expect to feed in at one end of the pipe and what they expect to get out at the other end. I'd agree that a neutral format like CSV would be preferable. But I am not in a position to challenge the business on this issue, so all I can do is try to find a way to implement it as painlessly and reliably as possible.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:...so all I can do is try to find a way to implement it as painlessly and reliably as possible.


Well, I hate to say but those two adverbs are going to be difficult to achieve - take it from someone who's been through this - and I suggest that you get your management (as opposed to your client's) on board as soon as you possibly can. At best, I reckon you may be able to come up with a 90% solution, simply because you can't guarantee that the software you use is fully up-to-date with all of MS's new "bells and whistles". And how you work out when it isn't, I have no idea.

Furthermore, don't expect any solution you come up with to be particularly scalable. If your document volume doubles in the next few months because you did your job so well, expect the app to slow down unless you throw more hardware at it.

MS Office software was NOT designed for batched "customization" - or batched anything for that matter - it's just a basic fact. If you (or your company) decides that this is what they want, then they'd better be prepared for the consequences - and they are many, and frustrating, and probably costly (at least in development time).

The alternative: Pay Microsoft a big (and I do mean big) whack of money to do this for you. You might be able to get a contract that doesn't include "per document" costs these days, but at least you'll be able to sue the b*ggers if it doesn't work.

Winston
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
@Winston:
I couldn't agree with you more - I've seen enough of this kind of nonsense myself too. Unfortunately, I work for a public sector organisation where the business users generally don't give a toss what we IT people say, and they don't want to change anything in their current practices. As a mere coding monkey at the bottom of the totem pole here, I don't have any clout to sway their opinions in a very hierarchical organisation that places very little value on technical considerations, while our management typically kowtows to the business managers because the projects are ultimately funded by the business. So basically if the business users say "We want crap based on ludicrous requirements that force the developers to produce inefficient and fault-prone implementations", then that's what they get. Caveat emptor applies, with the further caveat that whatever goes wrong is still our fault, even if we warned them a thousand times in advance. All I'm trying to do is figure out the least-worst option for navigating this technical and political labyrinth.


 
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We have built an application that runs financial models on the cloud. The problem is business users love building models in Excel. It is very easy for them. They can just take what's in their head and bang out a model in couple of hours instead of talking with a developer. Just imagine if you had an option to build something using Java, or ask a team of people to type out your program for you in punch cards. You would prefer to build the program yourself right? That's not a good analogy, but that's how they see us.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jayesh A Lalwani wrote:The problem is business users love building models in Excel. It is very easy for them. They can just take what's in their head and bang out a model in couple of hours instead of talking with a developer.


Just to clarify: my problem is not that people want to use spreadsheets; it's specifically that they use Excel, which is proprietary, and then want folks like us to do all sorts of batch manipulation on them. If they simply used OpenOffice or LibreOffice Calc, there is a whole Java API that is maintained by the same people that are responsible for the software, so there aren't likely to be any issues with version or compatibility.

Winston
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

chris webster wrote:with the further caveat that whatever goes wrong is still our fault, even if we warned them a thousand times in advance.


Well, the only other piece of advice I can offer then is that you make it absolutely clear - in writing - that you cannot guarantee that your solution will work in all cases, for all the reasons we've discussed. You might also mention that you could guarantee it if they switched to a non-proprietary format spreadsheet.

It sounds like it won't make much difference - as my mum used to say: none are as deaf as them that don't want to hear - but at least you'll have your stipulation on record.

Winston
 
Mark Beardsley
Ranch Hand
Posts: 32
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Not trying to derail this discussion but did you know that Excel will open - and I think save but I never tested that - Open Document Format spreadsheets? If it is the case that it will silently work with the ODF, then it might be worthwhile using this format yourself and manipulating documents either through UNO or with the ODFToolkit.
 
chris webster
Bartender
Posts: 2407
36
Scala Python Oracle Postgres Database Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the input, everybody.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mark Beardsley wrote:Not trying to derail this discussion but did you know that Excel will open - and I think save but I never tested that - Open Document Format spreadsheets? If it is the case that it will silently work with the ODF, then it might be worthwhile using this format yourself and manipulating documents either through UNO or with the ODFToolkit.


Interesting. I presume there's a Java API for this stuff that's up-to-date.

My worry with anything Microsoft - and specifically with anything Microsoft Office - is that unless you use a Microsoft product, you simply can't guarantee that it will work for ALL documents (and don't forget we're talking about a batching system that may be processing thousands of files) that you throw at it. If you use something like Calc, at least you have a proper Java API (and I'm pretty sure that most recent versions were actually written in Java) that covers pretty much everything - at least, I've yet to discover anything major that you can't do with it - albeit, at the expense of a a BIG learning curve.

Winston

[Edit]: I should note that my latest version of LibreOffice now supports 2003 format; but how much I'd trust it with critical spreadsheet applications I don't know.
 
Mark Beardsley
Ranch Hand
Posts: 32
1
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, the ODFToolkit is the default for manipulating ODF files as far as I am aware. There are two versions of it; one that operates at a very low level and a second that is based upon the first but offers some level of abstraction. You do need to be aware of the file format and how the individual sections hang together in order to get the best from it in my opinion. UNO would protect you from this somewhat as it allows you to manipulate Open/LibreOffice directly from Java code. The problems lie in the fact that, as Winsotn says, the OOXML file format is proprietary. Not only this, but it has changed since it was first introduced, moving from ECMA-376 to ISO/IEC 29500 - http://msdn.microsoft.com/en-us/library/gg607163%28v=office.14%29.aspx adding the complication that all ECMA documents can be read by software that is targetted at the ISO standard but not the other way around as far as I am aware. The makes the task of writing software to read/write these file complex and challenging and it is not possible for me to say whether the ODF will offer all of the features required for a specific application nor whether Open/LibreOffice will work with the latest version of the file format.

Some of the discussions I follow are those found on the POI Users list. One gives an interesting view of the sorts of problems you describe and it relates to the default sizes of columns. According to the file spec' this is an optional setting but now, the latest versions of Office actually insist that the value is set and will refuse to open files if this is not done. It appeared to be easy to patch POI, but the concern is that Microsoft do not obey the specifications that they themselves lay down. When working on batches of documents from various sources, who knows how they will have been created and what versions of the applications were used? Little bugs similar to the one above can undermine all of the work you have done. Are we almost coming round to the point where the conclusion is 'do not do this except in the most general of cases and even then be wary that there may well be unexpected problems that you cannot solve'?
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic