I am at a point where I want to decide whether I should go with regex or tokenizer object. I need feedback of you guys for that. So here is my scenario: I am sending basic queries to google with keywords like "red is the color" or "red is associated" and then putting the result URLs in a linkedlist and start to crawl those pages.
I am looking for these keywords in those html pages, so for example if one sentence is "Red is the color bla bla bla bla." I want to grab that sentence and put it in an array to use it later.
I have successfully striped the html tags without problems but the problem I am having is those keywords sometimes come in the beginning of the sentence and sometimes they come in the middle and sometimesin the end. so when I try to match them through regex I couldn't figure out how to make them match optionally. I haven't tried using tokenizer but sometimes suggested me and I am interested. I have heard it is depreciated though, true?
so I hope I am making sense, what do you guys think? what kind of path should I follow?
thanks for the response. I couldn't figured out how to match the type of sentence I want. So for example let's say I am supplying the word pink as my variable and I am looking for sentences that has pink in it like the source text below(Basically I want to get sentences that has pink somewhere.): Pink is a combination of red and white. The quality of energy in pink is determined by how much red is present. White is the potential for fullness, while red helps you to achieve that potential. Pink combines these energies. Shades of deep pink, such as magenta, are effective in neutralizing disorder and violence. Some prisons use limited deep pink tones to diffuse aggressive behaviour.
and I want to have a regex that matches this. This is how I come so far:
so I am trying to get the words before the pink if there is any and then pink and if there is any words after pink I want to get them until period.
1) turn on flags for case insensitive and multi-line matching 2) Start with (but dont capture) the beginning of the input or a period followed by a space, via a look-behind 3) Start of capture (group 1) 4) match anything but a period, 0 or more times 5) match the word 6) same as 4 7) match the end of the input or a period 8) end of capture (group 1) [ April 12, 2006: Message edited by: Garrett Rowe ]
Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them. - Laurence J. Peter