• Post Reply Bookmark Topic Watch Topic
  • New Topic

Getting the text out of a HTML?  RSS feed

 
Csaba Kassitzky
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi everyone!
I made a program which gets the text out of a HTML file. It removes the tags, gets the text out from between tags and also gets any remaining parts if a "tag is unfinished". It also checks for /n/r and makes only one space if more would be needed. My problem is, that the linebreaks don't get into the right place. When i find a text, all the linebreaks go before it. Why? Btw, i need the texts to do stuff on them, then write them back, i skipped that from the code. Technically this code copies the file to another, but the variable are there to do comparison, etc.
textToWrite: this should be the same as we read in, but it's not. In certain cases i would write some stuff before and after it, but that should not change the linebreaks.
textToCompare: the same like textToWrite, but linebreaks are removed.
tagToWrite: tags, skipped from other functions, but will be written back unchanged as well.

This is my code:


Could anyone please help me? Thanks!
[ December 13, 2005: Message edited by: Csaba Kassitzky ]
 
Paul Clapham
Sheriff
Posts: 22832
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When i find a text, all the linebreaks go before it.
Okay, so is that how the program is supposed to work, or how it actually works? If you want the whitespace to stay where it is in the text, then just don't do anything special with whitespace when you're reading it in.

If it were me, I wouldn't write any code like that. I would run the HTML through HTMLTidy, making it well-formed XHTML, then use an XML parser on it and extract the text nodes from that.

But if you really want to use a hand-saw to cut through a log that's 60 cm in diameter, let me suggest not cleaning up the whitespace as you read it in. Just read the entire text node up to the beginning of the next tag, and when you have the whole thing then clean it up.
 
Csaba Kassitzky
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In the last part(when the comments says if it's a linebreak), i check if the character read in was a linebreak or not. I add it to the TextToWrite, but for comparison(textToCompare), i need a single line of string with the same inhalt. That's why i use noMoreSpace to avoid double spaces in case of /r/n. It saves it as well, but textToCompare will be empty(first spaces will be removed). I tried to write in the writing method:

but it still not working.
BTW. how can i extract the texts from a HTML? Will i get each sentence between tags(i mean will i get the mass of the texts or will i be able to tell where the tags were)?
[ December 14, 2005: Message edited by: Csaba Kassitzky ]
 
Joe Ess
Bartender
Posts: 9441
12
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Java has an HTML parser built in: Java Almanac: Getting the text in an HTML Document
Now if you want to do something special with certain tags, you should look at modifying the anonymous HTMLEditorKit.ParserCallback implementation, which declares methods for the various kinds of tags.
 
Tom Blough
Ranch Hand
Posts: 263
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Or, use regular expressions:



The regular expression removes anything between and including the < and > and since regular expressions are greedy, it will continue to do it to the end of the line.

Cheers,
[ December 14, 2005: Message edited by: Tom Blough ]
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!