• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Liutauras Vilda
Sheriffs:
  • Rob Spoor
  • Bear Bibeault
  • Tim Cooke
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Piet Souris
Bartenders:
  • Frits Walraven
  • Himai Minh

Sorting huge files by using an index

 
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

How does one index a flat (CSV) file for the purposes of sorting? If I know of the exact columns on which I have to sort (or generate the index for), what are the various techniques available for doing it? I read through the web and all indexing related tutorials were targeted towards DBMSes. Does anyone know of a tutorial for indexing flat CSV files? Or has anyone reading this post tried it before?

My file is not fixed width, so using a RAF is not recommended.

I know of internal and external sorting algorithms, but have never tried indexing a flat file before. How does it work?

Thanks,
Prashant.
 
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can you please explain what the purpose of indexing the file is? How will the index be used?
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Since your record length is not fixed you are going to end up using a RandomAccessFile to make use of the index - it is the only way to jump to a record start.
Off the top of my head I would say that you will have to scan the file looking for the starts of lines, for each line start record the file position and grab the content you are going to sort on - sounds like a job for a custom object containing two variables:

long fposition ;
String key ;

store those guys in a collection - maybe a TreeMap - when your scan is done, the TreeMap will have the sorted order and the fposition will point to the line start so you can do a RAF seek to it.

Bill
 
Prashant Sehgal
Ranch Hand
Posts: 56
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Has anyone heaad of Lucene?

It's an indexing API form jakarta.org
 
Author
Posts: 31
5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How big is the file? Simplest solution is just to read in all in, sort in memory and dump it out again.

Otherwise go with a simple in-process database like http://hsqldb.sourceforge.net
and load the CSV file into a table
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic