• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to get RegEx to Search a Text File with All kinds of charactrers?  RSS feed

 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

I have a really long text file that has email history in it, aptly named: "History".

Using the RegEx below, I can open the file in, say, TextWrangler (free text editor with Regex) on the Mac and I get all the matching phone numbers. No problem.

Here's the code:



----

But, in Java, not so much. At least not what I have so far. (This code works fine on simpler files)

Yet, when I try this code in Java 8, I get:



I've tried several of the Oracle "Charset" options, but none work.

So, how do deal with really nasty text files with all kinds of weird symbols or characters so I can extract just the phoone numbers per the regular expression?

Thanks,

-mike
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Are you sure that the numeric characters are all coded the same? ASCII or whatever?

If so I suppose you could read the file as bytes, patch all the non numbers to something java likes and rewrite it.

Bill
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
William Brogden wrote:Are you sure that the numeric characters are all coded the same? ASCII or whatever?

If so I suppose you could read the file as bytes, patch all the non numbers to something java likes and rewrite it.

Bill


The file has just about every type of character in it imaginable.

Since TextWrangler can do the search with no problem, I was guessing/hoping, it wouldn't be a huge deal in Java.

Hard to know what Java might not "like" in a crazy file like this one.

So, I'm guessing you mean I should read the file byte by byte first. If each character is alphabetic or numeric, then write that character to another file. Then, when that pre-processing is done, do the search for phone numbers on the revised file.

Correct?

Thanks Bill,

- mike
 
Paul Clapham
Sheriff
Posts: 22819
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:The file has just about every type of character in it imaginable.


But that's not the relevant question. The question to ask is "What encoding was the file written in?"
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:
Mike London wrote:The file has just about every type of character in it imaginable.


But that's not the relevant question. The question to ask is "What encoding was the file written in?"


Hi Paul,

Thunderbird, the client's email program, reports UTF-8 as the encoding for sending and receiving email.

I tried UTF-8 in the original Java 8 Charset before posting, but it didn't make any difference.

If I do a "file -i" on the file at the OS level, it just gives me an unhelpful "Regular File".

I've successfully re-written the file using FileInputStream in a separate class and was getting ready to post back, but I'm still confused why "Filter" doesn't ... Filter just what I want, or could specify.

Thanks,

- mike
 
Paul Clapham
Sheriff
Posts: 22819
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
UTF-8 sounds like a good idea, but your error messages look like the ones you get when you try to use UTF-8 on files which weren't actually encoded in UTF-8. But you have a file, which I suppose was extracted from e-mail by some process. Looks like that process didn't use UTF-8 to write that file.

On the other hand you have an application (presumably not written in Java) which can read it successfully, so I don't know what's up with that. Perhaps it just skips over the odd byte here and there which it can't make sense of, rather than crashing.

As for reading the bytes and trying to deal with them, that could have some problems. Since your data contains all kinds of characters, it must be encoded in such a way that some of the characters require more than one byte. You might have to start looking at the part of the file which Java and UTF-8 can't deal with and use a hex editor to find out what bytes are actually in the file at that point. That could start a process to figure out how the e-mails are actually encoded in your file.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:UTF-8 sounds like a good idea, but your error messages look like the ones you get when you try to use UTF-8 on files which weren't actually encoded in UTF-8. But you have a file, which I suppose was extracted from e-mail by some process. Looks like that process didn't use UTF-8 to write that file.

On the other hand you have an application (presumably not written in Java) which can read it successfully, so I don't know what's up with that. Perhaps it just skips over the odd byte here and there which it can't make sense of, rather than crashing.

As for reading the bytes and trying to deal with them, that could have some problems. Since your data contains all kinds of characters, it must be encoded in such a way that some of the characters require more than one byte. You might have to start looking at the part of the file which Java and UTF-8 can't deal with and use a hex editor to find out what bytes are actually in the file at that point. That could start a process to figure out how the e-mails are actually encoded in your file.


Thanks again. Yeah, this file has image byte code in it and just about any other file data you could send in an email including the phone numbers in the email text.

The byte code reading is working, but quite slow, about 1 MB/sec. The whole history file is just over 1 GB.

I guess the slowness is because it's doing a "read()" of each byte at the FileInputStream level.

As a "todo" item, I should probably try to read a line from the file into memory and then read() that line byte by byte (for performance).

This was an interesting learning experience.

Thanks very much for your kind help. :)

-mike
 
Paul Clapham
Sheriff
Posts: 22819
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Mike London wrote:I guess the slowness is because it's doing a "read()" of each byte at the FileInputStream level.


That could be so. You could wrap the FileInputStream in a BufferedInputStream and that might help.
 
Mike London
Ranch Hand
Posts: 1505
11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:
Mike London wrote:I guess the slowness is because it's doing a "read()" of each byte at the FileInputStream level.


That could be so. You could wrap the FileInputStream in a BufferedInputStream and that might help.


Exactly what I was thinking.

Thanks!

- mike
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!