• Post Reply Bookmark Topic Watch Topic
  • New Topic

Extracting a string from HTML Source Code  RSS feed

 
Marcus Rauchfuss
Ranch Hand
Posts: 51
1
Eclipse IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

I want to extract a specific String from HTML, specifically, I want to extract a String from in between <...>

So far, I've got this


The problem I have is when I change the last parameter in this line:



to



i.e. the generic alternative, I get this error message:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -366
at java.lang.String.substring(Unknown Source)
at main.HTMLGrabber.main(HTMLGrabber.java:45)


Is there a better and simple way to extract a substring?

Thank you.

 
Stefan Evans
Bartender
Posts: 1837
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Do you understand the error message? String index out of range: -366
Do you understand WHY you are getting it?
What does the number -366 indicate?

You do realise that there will be more than one closing tag in the text you are searching, and the indexOf method starts searching from the start of the string...
 
Marcus Rauchfuss
Ranch Hand
Posts: 51
1
Eclipse IDE Java Python
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, Stefan, I am aware what this message means, yes, Stefan, I know there are other > in the code (because there are lots of > in HTML, I actually do a lot of HTML pushing in my day job) and my question was:

Is there a better and simple way to extract a substring?


because I am aware the one I have chosen does not work.
 
Stefan Evans
Bartender
Posts: 1837
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry, if that came over a little patronizing (a little?!?)
I got focused on fixing the issue rather than answering your actual question.

I'll continue in that line for a moment :
The substring method takes two arguments startIndex and endIndex.
Obviously your change has made it so that the endIndex calculated is before the startindex.
You could possibly fix this by using the version of the indexOf method that specifies a starting point to search from.

ie:




To answer your question: Is there a better way?
Well there is another way: Using a regular expression to capture the part you are interested in.

Using regular expressions to parse full HTML is not generally recommended, but if all you are after is the content of the meta keywords tag, then it should be something like:

 
Junilu Lacar
Sheriff
Posts: 11493
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would just use one of the many libraries out there to parse the html then find the meta element with a keywords attribute.
 
Liutauras Vilda
Sheriff
Posts: 4918
334
BSD
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.
 
Junilu Lacar
Sheriff
Posts: 11493
180
Android Debian Eclipse IDE IntelliJ IDE Java Linux Mac Spring Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Liutauras Vilda wrote:http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.


He doesn't have JSON though, he's trying to parse HTML.
 
Liutauras Vilda
Sheriff
Posts: 4918
334
BSD
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Junilu Lacar wrote:
Liutauras Vilda wrote:http://www.oracle.com/technetwork/articles/java/json-1973242.html
Marcus, have a look at this. You could use JsonParser.


He doesn't have JSON though, he's trying to parse HTML.

Sorry, I hastened. I meant Jsoup. Thanks for correcting me.
Marcus, please ignore my previous post, Junilu is absolutely right.
I do apologise for misleading post.

Here is what I wanted to post for Marcus.
http://jsoup.org
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!