I have two csv text files, one with about 17,000 rows and the other with about 3,000,000 rows. The file with 3,000,000 has a field that I need to attach to each of the 17,000 rows if they match on three values. My current setup is: Read the 17,000 rows into an ArrayList using the Ostermiller CSVParser. Read the 3,000,000 rows in using the Ostermiller CSVParser. For each of the 3,000,000 rows iterate through the ArrayList and look for a match. If a match is found write it out to a text file and exit the loop. This solution works and I am able to process roughly 200 to 210 rows per second using the full file of 17,000 and a sample of about 75,000 from the large file. My question is, is this a good solution or is there a better more efficient way of doing this? If my calculations are correct, it will take close to 3 hours to process the full file. And I may have to process the file several times.
I've never used HashMap and am not familiar with it. I will have to do some more reading. Based on your explanation it does sound a lot more efficient than iterating over the entire ArrayList. Thank you for the suggestion.
I just read up on HashMap. While this will prove useful for other things I'm working on, I'm not sure it will work in this case. I said I was matching on three fields but that's not 100% true. I'm matching on two fields and the third field is a partial match. The third value from the large file must be contained within the third field on the smaller file.
You could use a HashMap where the key uses only the first two fields, and the value is an ArrayList containing all the rows that match those two fields. That way you can at least use the HashMap to quickly zero in on a much smaller list of rows to compare with, then loop through those to see which give you the desired partial match on the third field.
If your problem is selecting a record from the 3M records based on 3 fields in the 17K records, it seems like you are going to have to match each of the 17K records against all of the 3M records to see it there is a match.
By using a hashmap you can reduce the compares needed. For each of the 17K records get the fields to match on, concat them and enter them as a key to the hash with a pointer to the source record as value. If there are duplicates, use an ArrayList to hold the pointers. Then go thru the 3M records, get the corresponding fields to match in each record, concat them and see if there is a key with that value. If not, you can skip that record. If there is, get the record via the value ptr and check it match.
I assume that the fields being matched are at fixed locations. You don't say what type the data is: bytes, String or ???. Try running your app with the sample data using -Xprof to see where the time is spent. I got 20% improvement in an app by replacing calls to Uppercase with my own code that took some shortcuts because I knew what kind of data I was working with. -Xprof showed me where my code was spending its time.
Also look at using nio channels to read the data. [ August 13, 2008: Message edited by: Norm Radder ]
It looks like HashMap allows the insertion of duplicate keys. And it seems that upon retrieval, if there are duplicate entries it grabs the first occurence. It is quite likely that I have duplicate keys. The correct entry will be evaluated upon retrieval of the value. Is there a way to handle multiple matches? I have no other way to make the key unique.
Originally posted by Bill Hayes: It looks like HashMap allows the insertion of duplicate keys.
No it doesn't. If you try and insert a key/value pair when the key already exists, the new value will replace the old value. If you want to have multiple values associated with a key then you need to add the values to a List and make a reference to the list the value associated with the key. So your add method would be something like [ August 14, 2008: Message edited by: Joanne Neal ]
Thanks for the clarification. I just want to make sure I'm understanding correctly. Before I add my current key, I need to search the map to see if the key already exists. If it does exist I need to grab the exisitng value and add it to a List containing my current value then add the List to the map.
So when searching the map for an existing key, and the key is found, I first need to determine if the value is a single value or a List and handle appropriately? Or do I add every value as a List even if there ends up being only one value? The latter sounds easier to me, but is there a performace hit in doing so? My List will contain 5 fields per row.
The latter. The value associated with the key is always a List which contains your values. When you search the map for a key's value, it will either return a List or null (if the key doesn't exist). If it returns a List, you simply add the new value to the List. If it returns null, you create a List and add it to the Hashmap with the key. You then add your new value to this new List.
[ August 14, 2008: Message edited by: Joanne Neal ]
look like you are facing sort of as jion problem. The thing you first implemented is a so called nested loop join, the most simple and basic join algorithm, but aso the least efficient, it's O(n�).
The algorithm discussed in the thread is a so called hash join, which is a little more complex and works for equi-joins only, but also way more efficient, it's O(n).
If your values are strings, which I assume for CSV data, and if your 3,000,000 lines are sorted, you might also give a thought to a o called sort-merge-join, which is O(n log n).
Literature on database systems will provide you with a lot of details and implementaion hints for all three of the algorithms, just give it a google. If you face problems like the one discussed here more often, it'll be really woth the reading.
Or insert your data in a database using JDBC and let the database engine do the join for you. It would be a left outer join, assuming that your 17,000 lines are the left hand side.
I would like to thank everyone for helping me work through this solution. In the end I used the HashMap and it works beautifully. My small file ended up being 1.3 million records and my large file ended up remaining around 3 million. I was able to load the 1.3m records into the HashMap with values stored as a List. I was then able to look up all 3m records in the HashTable and find my partial match on the third field. I had to sort out a Heap Space (out of memory) issue, but in the end I was able to do this all in less than 1 minute with the last run being 49 seconds. Quite an improvement over my initial solution taking 3 hours for only 17k records in the small file. Thanks again, Bill
Politics is a circus designed to distract you from what is really going on. So is this tiny ad:
a bit of art, as a gift, the permaculture playing cards