• Post Reply Bookmark Topic Watch Topic
  • New Topic

Retrieving word lists using htmlunit and xpath  RSS feed

 
Tomasz Wontek
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Since I often compile various wordlists and it's horridly time-consuming, I have decided to automate at least part of the process. Alas, my programming knowledge was (and pretty much still is) virtually non-existent, so I had to learn from scratch. Luckily I have managed to write some code, but there are obviously few mistakes and I'm really clueless what to do, therefore I would ask to correct it (if it's not too much ).

The first program is to retrieve a list of words from a wiktionary's category page and append definitions from corresponding pages.


The second - and much simpler - one is to retrieve definition of a given word form a dictionary page.


I would be sincerely grateful for any kind of answer (excluding "get lost" ).
 
Stuart A. Burkett
Ranch Hand
Posts: 679
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
TellTheDetails
What problem are you having ?
Does the code compile ? If not, what errors are you getting ?
If the code compiles and runs, what is it doing that it shouldn't ? Or what is it not doing that it should ?
 
Tomasz Wontek
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In both cases, I am getting NoClassDefFoundErrors. Precise text looks as follows:

Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/css/sac/ErrorHandler
at javaapplication7.JavaApplication7.main(JavaApplication7.java:16)
Caused by: java.lang.ClassNotFoundException: org.w3c.css.sac.ErrorHandler
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
BUILD SUCCESSFUL (total time: 2 seconds)


I've just added commons-httpclient-3.1.jar but it still doesn't work.
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
HtmlUnit needs a lot of other libraries besides HttpClient. This particular class is in a file called sac-1.3.jar.
 
Tomasz Wontek
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
After adding about 10 new libraries, I have started to get NoSuchMethodError ...
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.xpath.compiler.FunctionTable.installFunction(Ljava/lang/String;Ljava/lang/Class;)I
at com.gargoylesoftware.htmlunit.html.xpath.XPathAdapter.initFunctionTable(XPathAdapter.java:82)
at com.gargoylesoftware.htmlunit.html.xpath.XPathAdapter.<init>(XPathAdapter.java:96)
at com.gargoylesoftware.htmlunit.html.xpath.XPathUtils.evaluateXPath(XPathUtils.java:130)
at com.gargoylesoftware.htmlunit.html.xpath.XPathUtils.getByXPath(XPathUtils.java:86)
at com.gargoylesoftware.htmlunit.javascript.host.HTMLCollection.getElements(HTMLCollection.java:263)
at com.gargoylesoftware.htmlunit.javascript.host.HTMLCollection.jsxGet_length(HTMLCollection.java:410)
at com.gargoylesoftware.htmlunit.javascript.host.Window.getWithFallback(Window.java:881)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$2.get(JavaScriptEngine.java:191)
at org.mozilla.javascript.ScriptableObject.getProperty(ScriptableObject.java:1544)
at org.mozilla.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1375)
at org.mozilla.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1364)
at org.mozilla.javascript.Interpreter.interpretLoop(Interpreter.java:2965)
at org.mozilla.javascript.Interpreter.interpret(Interpreter.java:2394)
at org.mozilla.javascript.InterpretedFunction.call(InterpretedFunction.java:162)
at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:393)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:192)
at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:2834)
at org.mozilla.javascript.InterpretedFunction.exec(InterpretedFunction.java:173)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$5.doRun(JavaScriptEngine.java:428)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:550)
at org.mozilla.javascript.Context.call(Context.java:577)
at org.mozilla.javascript.ContextFactory.call(ContextFactory.java:503)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:437)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:412)
at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:918)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:285)
at com.gargoylesoftware.htmlunit.html.HtmlScript.appendChild(HtmlScript.java:193)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.handleCharacters(HTMLParser.java:518)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:480)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:210)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:329)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:971)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:874)
at org.cyberneko.html.HTMLScanner$SpecialScanner.scan(HTMLScanner.java:2906)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:877)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:495)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:448)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:263)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:116)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:89)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:456)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:365)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:401)
at javaapplication8.JavaApplication8.main(JavaApplication8.java:22)
Java Result: 1

I don't know whether it's relevant, but org.apache.xpath.compiler.FunctionTable should be included in already added xalan.jar
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's not missing a class, it's missing a method called installFunction. Sounds like it's expecting a different Xalan version than is being used. Do you maybe have some other Xalan version in your classpath already?

10 libraries sounds low - my HtmUnit lib folder has 21 jar files (which may not all be needed, though). Safest is probably to use all the ones that come with the HtmlUnit distribution.
 
Tomasz Wontek
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, I have updated xalan version to the latest and the result is still the same ...
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Don't use the latest - use the one that comes with HtmlUnit.
 
Tomasz Wontek
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, so I have replaced all jars in library with those included in htmlunit package. It said that BrowserVersion.FIREFOX_2 cannot be found, but I changed it to FIREFOX_3_6 and it's no longer a problem. Unfortunately it turned out to be only a minor thing, since later on I got huge load of exceptions, errors and so on.
I don't want to litter entire thread, so you can see the list under the link. More edible/shortened version is here.

In case of the second program, list of exceptions is way more reader-friendly:
sie 16, 2013 9:24:03 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
sie 16, 2013 9:24:05 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
sie 16, 2013 9:24:05 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
sie 16, 2013 9:24:06 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
sie 16, 2013 9:24:06 PM com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument close
WARNING: close() called when document is not open.
sie 16, 2013 9:24:07 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Recursive src attribute of iframe: url=[about:blank]. Ignored.
sie 16, 2013 9:24:07 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
sie 16, 2013 9:24:07 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter warning
WARNING: warning: message=[Calling eval() with anything other than a primitive string value will simply return the value. Is this what you intended?] sourceName=[http://www.gstatic.com/bg/tjsDBsmLuJ7wXAjVgfghuWaq-F1tQ6NsQnyRzBFBjRY.js#1(eval)] line=[1] lineSource=[null] lineOffset=[0]
Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlTextInput
at javaapplication7.JavaApplication7.main(JavaApplication7.java:27)

Maybe I should add, that both setJavaScriptEnabled and .FIREFOX_3_6 are crossed out by NetBeans and I got these exceptions even with break point set on
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I see lots of warnings that are caused by HTML and/or CSS not being of to specifications; there isn't much you can do about that, except possibly turning off such warnings.

The only actual exception is in your code, where something is null. I can't correlate that message with the code you posted, so I'm not sure what's going on. A bit of debugging should tell you pretty quickly what object is null.

As to the cast exception, HtmlPage.getByXPath returns a List, not a single object, so that's what you need to cast it to.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!