Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Comparing two files - One file has simple Text(2GB) and other file has regular exprs( 14K records)  RSS feed

 
Shubham Chhabra
Greenhorn
Posts: 9
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Need help:

There are two files :

1. First file has text strings - 8198946|0|0|82|0|0|2011-06-20 00:00:00|40|0.16|DAILY|0.08|braingle.com|0|2011-06-21 15:12:33|--GVFBIIsTHTghxTWkdcaCWItzg|0|USD
2. Second File has regular expresseions like - .*2011-06-20 00:00:00.*DAILY.*i-dressup.com.*--GVFBIIsTHTghxTWkdcaCWItzg.*USD.*

I want to find the records from first files which are matching with regular expressions from second files.

So in short there are 14K regular expression in second file which I want to match with other file records and find out the records matching.

I am using below code, but it is very slow.

Anybody can suggest any other alternative?





thanks
Shubham
 
fred rosenberger
lowercase baba
Bartender
Posts: 12542
48
Chrome Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First...PLEASE learn to use code tags (I added them above). It makes you code MUCH easier for everyone else to read. You can read the FAQ on them here.

Next...

You have millions of lines of data, and you have to compare each against 14,000 regular expressions. if your sample line is representative, you have 16,000,000 lines in your file. that effectively means you have to do 224,000,000,000 comparisons.

your options as I see it (and I am not an expert) are:

1) buy a more powerful computer
2) reduce the number of comparisons
3) wait it out.

Which you do depends on what your actual requirements are.
 
fred rosenberger
lowercase baba
Bartender
Posts: 12542
48
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
another thought...this task sounds like it might be something you could parallelize. Split the regex file into pieces, and pass those + the data file to multiple cores/servers, then combine them back (as appropriate) when done.
 
Jayesh A Lalwani
Rancher
Posts: 2762
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1) Don't compile the pattern every time. Pattern compilation takes a lot of time. Instead of ArrayList<String> al=new ArrayList<String>(); keep a list of Patterns. As you read your pattern file, compile the pattern and put it in the list
2) This is unnesscary.. It probably adds a little bit of overhead.. but not as much

You can do this



Creating a iterator is light. It will be faster that rolling back the iterator all the way back through 14K records

3) BTW.. why are you using regex matching. Each record in your source file is pipe delimited. It might be simpler to simply break each data element in each record by splitting by the pipe character, and then simply matching the fields. That's seems simpler to me. It;s probably not going to be much faster. Regex is complicated.
 
Rico Felix
Ranch Hand
Posts: 411
5
IntelliJ IDE Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can probably try the following optimized sequence laid out in the code snippet:


Besides what have been illustrated here, for even more performance improvements you'll have to employ multi-threading...
 
chris webster
Bartender
Posts: 2407
36
Linux Oracle Postgres Database Python Scala
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jayesh A Lalwani wrote:.. why are you using regex matching. Each record in your source file is pipe delimited. It might be simpler to simply break each data element in each record by splitting by the pipe character, and then simply matching the fields. That's seems simpler to me. It;s probably not going to be much faster. Regex is complicated.

Could you load the pipe-delimited files into a database e.g. an RDBMS table with separate columns for each field? Then you could use SQL with/instead of regex to execute the checks against the specific fields. 16 million data records isn't that big for a database, but the number of searches is obviously quite large, so can you also organise your search patterns better? For example, suppose you have 100 regex patterns ending with "USD". If a given record doesn't have "USD", then you don't need to check all 100 "USD" regex patterns if the first "USD" pattern fails.

Your regex patterns look like they should really be broken out into properly defined queries based on individual fields e.g. currency, which might allow you to re-factor them to a more manageable number or at least apply some kind of branching logic so you don't have to repeat pattern checks that you know are going to fail. If your regex patterns can actually be converted into database records with similar fields, then the problem could even be turned into a big SQL join query. For example, if you have a regex pattern that really means "...WHERE data.currency = regex.currency AND data.foo = regex.foo ...", then you could simply re-cast the entire problem as one or more SQL joins between your data table and your regex table. You'd need to analyse your data and regex to see if they are really field-matching queries, and be sure to index your data table appropriately.

If this is something you will have to do regularly, you could try loading the file into Hadoop (look at Hortonworks Sandbox?), converting it into a Hive table, then using SQL over the Hive table, which would give you the convenience of SQL (which is designed for queries) and the potential scalability of distributed processing on Hadoop.

There are other powerful search tools like Apache Solr and Elasticsearch, but I don't know anything about these.

I'm an old database developer, so I'm undoubtedly biased, but this definitely smells like a search/database task to me.

Incidentally, what do you think your output will look like? For example, if a given data record matches 1000 of your regex patterns, how many times do you want to see it in the output? How will you know which pattern it matched, and is this important?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!