• Post Reply Bookmark Topic Watch Topic
  • New Topic

HTML to plain text parser

 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have html content store in database.

I want to retrieve and want to display as unformatted text.

Are there any utility that parse HTML content into text?

e.g.
What I have following in database?
<p><b>Chetan Parekh</b></p>

What I need?
Chetan Parekh
 
Anoop Chandran
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You may need to write a parser which looks for html tags and will take off if that is contained in the specified String. Hope you are getting the data from db as Blob.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Anoop Chandran:
You may need to write a parser which looks for html tags and will take off if that is contained in the specified String.


I am looking for redymade parser that does the same. Are there any?

Hope you are getting the data from db as Blob.

You are right.
[ December 16, 2005: Message edited by: Chetan Parekh ]
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
NekoHTML is an HTML parser which produces a DOM tree. I'm not sure if it can export the plain text, but it should be a good and easy starting point.

I don't think you need to store HTML as Blob - Clob should be sufficient, which would make it easier to work with.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Only this will do
 
Michael Duffy
Ranch Hand
Posts: 163
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Michael Duffy:
I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.


We are developing content management system, where user can submit formatted content that we need to store in database.
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You might find the open source JTidy utility to be helpful. You might even want to run the submitted formatted content through JTidy before accepting it to keep bad HTML out of your database.
Bill
 
Ulf Dittmer
Rancher
Posts: 42970
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
String thisStringHasNoHtml = stringWithHtml.replaceAll("\\<.*?\\>","");


This will not work. E.g. "<abc>text</abc>" will be reduced to nothing, because most regexp packages perform greedy matching. That means that they match as far to the right as possible, and don't stop at the first possible match if a longer one is available.
Either use the non-greedy option if it is available, or a string like "\\<[^<]*?\\>", which prevents another opening angle bracket to be part of the match. It's probably better to replace by a space -and not the empty string-, so that words don't get joined inadvertently.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!