• Post Reply Bookmark Topic Watch Topic
  • New Topic

[Fatal Error] :1:1: Content is not allowed in prolog.

 
maria kumar
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
URL url = new URL("http://en.wikipedia.org/wiki/India");
HttpURLConnection urlCon = (HttpURLConnection) url.openConnection ();
InputSource source = new InputSource(urlCon.getInputStream());
//source.setEncoding("UTF-8");
source.setEncoding("ISO-8859-1");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setExpandEntityReferences(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
builder.setEntityResolver(new NullResolverForTest());
Document doc = builder.parse(source);

class NullResolverForTest implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId) throws SAXException,
IOException {
return new InputSource(new StringReader(""));
}
}

The above code works for some websites,How to give exact encoding value to in order work for all websites..

Thanks,
Maria.
 
Rob Spoor
Sheriff
Posts: 20817
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You shouldn't use an XML parser for HTML pages. HTML pages are notorious for not having to be proper XML documents. Use an HTML parser instead.
 
maria kumar
Greenhorn
Posts: 11
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes Rob. You are true. But My intention is to get Document object of DOM.Can you guide me which HTML Parser gives DOM Object (of w3c).
 
Rob Spoor
Sheriff
Posts: 20817
68
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Check out http://java-source.net/open-source/html-parsers
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!