This week's book giveaway is in the Performance forum.
We're giving away four copies of The Java Performance Companion and have Charlie Hunt, Monica Beckwith, Poonam Parhar, & Bengt Rutisson on-line!
See this thread for details.
Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

RegularExpression.java

 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Trying to write the second phase of my program, reads in a standard one word per line dictionary - built in first phase of program - and searches second file of unknown length & line length and looks for matches.

Obviously, this problem domain is well researched in Regular Expressions, but A:\RegularExpression.java is 129,152 characters, 11,152 words, 3,189 lines, which I don't mind tearing into if it will do me some good.

I found this file by going to the java.sun domain and looking for Regular Expressions, then opening Src.zip in my newly downloaded JDK-5

I just want to make sure I am reading the right file as this is quite a deep-well of information just to do first-draft, get-it-sputtering coding.

Java site gives package name of [java.util.regex]
File A:\RegularExpression.java gives:
[com.sun.org.apache.xerces.internal.impl.xpath.regex]

as package name, do I have the right file ?

http://www.docdubya.com/belvedere/statement/Denial.html
[docdubya.com expired on 12/02/2006 and is pending renewal or deletion. ]
"....we can't guarantee a valid email hasn't been tossed, but the alternative is nothing would get done." - Greg Comeau

anybody laughing ?

[ December 09, 2006: Message edited by: Nicholas Jordan ]
[ December 10, 2006: Message edited by: Nicholas Jordan ]
 
marc weber
Sheriff
Posts: 11343
Java Mac Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Nicholas Jordan:
...this is quite a deep-well of information just to do first-draft, get-it-sputtering coding...

Indeed, it is!

Rather than dissecting API source code (which exists, after all, so that we don't need to concern ourselves with those inner workings "just to do first-draft, get-it sputtering coding"), I would start with this Sun Tutorial - Regular Expressions.

And, of course, refer to the API documentation (especially for the Pattern and Matcher classes).
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are at least three complete regex implementations in the JDK, but the one that's meant for us to use is the java.util.regex package. Why are you looking at the source code, anyway? If you want to learn how to use regexes, take a look at this site:

http://www.regular-expressions.info/

And if that leaves you hungry for more, there's The Book:

http://regex.info/
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Alan Moore:
Why are you looking at the source code, anyway?


I would LOVE to discuss this; Duntemann gives a really human to human answer in the preface and introduction to Assembly Language, Step by Step.

When I watched Silence of the Lambs, I thought the guy was portraying a Clown, be though it may a good one.

You live in a different world from the one I do, or you wouldn't ask the question.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Nicholas Jordan:
You live in a different world from the one I do,


That's a relief.
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Alan Moore:
And if that leaves you hungry for more, there's The Book:


I was working on my second reading of Mastering Regular Expressions last night.

I am in theLesson: Regular Expressions (The Java™ Tutorials > Essential Classes)tutorial right now.

Program (this phase) is simple, in student's concept.


I am sure this program has been written and studied thousands of times, I am open for suggestions as the ultimate intended user is not a bench technician and I have to forsee all reasonable failure modes trapping to an error log for sys-admin.


[ December 09, 2006: Message edited by: Nicholas Jordan ]
 
marc weber
Sheriff
Posts: 11343
Java Mac Safari
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There are a number of compelling reasons for taking apart source code, but usually not among those is "just to do first-draft, get-it-sputtering coding."

In the long term, if you really want to understand how these classes work (and sometimes don't work), that can be a valuable and worthwhile approach. But in the short term, -- e.g., under a deadline to write working code today -- the meandering scenic route might not be the best plan.

Understand that I'm not advocating undue shortcuts. I'm just pointing out that there are different levels of understanding, and different approaches to meet different goals.

 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by marc weber:

Indeed, it is!
... of course, refer to the API documentation (especially for the Pattern and Matcher classes).

I did, here is what I got.(since last post)
I know this is a large post, but there's an awful lot of masters here - this follow up adheres strictly to the posting guidelines - I have coded my question, compiler says clean build on this code.

[ December 09, 2006: Message edited by: Nicholas Jordan ]
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The java.util.regex package doesn't recognize \< and \> as word-start and word-end boundaries. Use \b to match either the start or end of a word.
 
Ernest Friedman-Hill
author and iconoclast
Marshal
Pie
Posts: 24211
35
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That a lot of characters. You're gonna get carpal tunnel, man.
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Alan Moore:
The java.util.regex package doesn't recognize \< and \> as word-start and word-end boundaries. Use \b to match either the start or end of a word.

Stupid question time: Efficiency ? - not for me to examine at this point, just get it working, correct ?

How do you differentiate begining of word from end of word to make sure you really have a word in the buffer ?
[ December 09, 2006: Message edited by: Nicholas Jordan ]
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ernest Friedman-Hill:
That a lot of characters. You're gonna get carpal tunnel, man.


Use ice, copiously. Take scheduled walk-around breaks.

There is no mercy where professionals are concerned.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is RegEx a part of the problem definition or only one possible solution among many? I always like to drag out Ternary Search Trees as a very fast way to look stuff up. You could load the word tree first, then read the text file a character at a time (buffered) and work your way through the tree for every word. I'd be interested to see how speed compares.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Nicholas Jordan:
How do you differentiate begging of word from end of word to make sure you really have a word in the buffer ?


or
You have to put something in there to match the actual word or words anyway.
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stan James:
Is RegEx a part of the problem definition or only one possible solution among many? ... I'd be interested to see how speed compares.


One possible solution, for one strictly defined problem domain that is recurrent throughout the application.

Do you want millisecond inner-loop times or overall responsiveness from the shell ?

RegEx is/was/will be first thought of answer - it is widely encountered throughout computer science, therefore will have been tweaked by more computer science workers than the Sargasso Sea has waves.

I recoded the loop this morning, using String class' methods, as an expedient to getting farther along in prototyping - JDK 1.2 does not compile regex - so commented it out ~ I will not be able to do critical loop timing untill I figure out some threading issues.

Tried it this morning, let me know by pm if you want the runlog.

[ December 09, 2006: Message edited by: Nicholas Jordan ]
[ December 09, 2006: Message edited by: Nicholas Jordan ]
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Alan Moore:
You have to put something in there to match the actual word or words anyway.

I figured it out, no matter how coded, one has to find a boundary, then lookahead to see if the next char is [a-zA-Z]; and so on.

I assume several approaches to trailing apostrophies and s's in plurals and so on will become within range of my coding skills after some use.
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think you're putting too much emphasis on word boundaries. Their main purpose is to make sure that any match you find is a whole word (or words), not a substring of some longer sequence. If you have a regex like this:
...and it matches something, there's no ambiguity about which \b matched where, and you don't need to do any extra lookaheads.

By the way, the regex above is a first cut at regex to match names that might include some non-word characters.

Why do you always use fully-qualified class names? That's a waste of time and space, and it makes the code harder to read. Just import the appropriate classes or packages and use the classes' simple names. But you won't need to import the java.lang package; it's imported by default
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Exactly what I match, or try to match, is something I need to give deep thought to, I will take your example and make it my first example. With any luck, I will come in tomorrow and tell you what I think it does. This is good for beginners, even if they fail - they give the matter some thought.

As for the fully qualified names, I work in crisis-intervention and blowout-control in multi-million dollar projects. When it is so bad that no one with any sense will take it, I take over.

When things are back to S.N.A.F.U., I hand it back to the people who are trained to to the job.

I have to know where everything is to the bazillionth, within about 30 ms.

It is hard to be effective in a house of mirrors with Clowns all around, I have developed a coding style that uses variable names from outside of computer science because of a diagnostic that I do not understand being issued on my C++ compiler when I use namespaces - it may be irritating, but even then I use variable names chosen for their memorability, and that will not under any reasonable test be in any build file supplied by compilers.

If I have to, I will reduce these for posting. I need the assistance.

My build directory clocks in at well over a quarter of a million bytes, I really have to know every line of code, every statement being exclusionary + concise.

There is no mercy between professionals.
 
Nicholas Jordan
Ranch Hand
Posts: 1282
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Stan James:
Ternary Search Trees as a very fast way to look stuff up.


As soon as I saw the concept, I unzipped it. Because it so closely models what I intend to do in the next phase of the program, it is a real home run if you like accolades, I dreaded trying to re-invent this wheel.

If and when something is found, there is a split decison on the basis of some information gleaned elsewhere, such as mabye this is an operator with authority or just a casual user who does not want to know. It shouldn't take a deep contemplation to come up with some other 'split-decisions' - but once branched, we stay on that side of the main trunk, so the tool effectively models my thinking.

I took a really short peek at your page, I am sure you can understand how this will adjuavate coding later in the project.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic