Win a copy of Programmer's Guide to Java SE 8 Oracle Certified Associate (OCA) this week in the OCAJP forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing RSS2.0 feeds using XML Pull Parser

 
Monu Tripathi
Rancher
Posts: 1369
1
Android Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to parse a RSS2.0 feed, obtained from a remote server, on my Android device using XML Pull Parser.

I am getting invalid token exceptions after a few items have been parsed:
Error parsing document. (position:line -1, column -1) caused by: org.apache.harmony.xml.ExpatParser$ParseException: At line 158, column 25: not well-formed (invalid token)

Strangely, when I download the feed XML on the device, bundle it as application asset and then run the same code, everything works fine.
If XML validation is requested: parser.setProperty(XmlPullParser.FEATURE_VALIDATION,true); parsing fails immediately. Eventually, I am going to ask the providers of service to validate the XML at their end.

Could character encoding be the problem here; since I can parse it when I read it locally?

Thanks.
P.S: have also asked this question here
 
Paul Clapham
Sheriff
Posts: 21314
32
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You are getting this XML in response to an HTTP request? Then it's possible that the encoding declared in the XML is different than the charset of the response. The rule in this case is that the charset of the response should be used by the parser, rather than the encoding declared by the XML.

However you're passing an InputStream to the parser, so the parser has no way to find out what was the charset of the response. Try passing the URL of your HTTP request instead and let the parser deal with the response directly. Or alternatively, get the charset from the response and construct a Reader which uses that charset.
 
Monu Tripathi
Rancher
Posts: 1369
1
Android Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Paul; will try this out
 
Monu Tripathi
Rancher
Posts: 1369
1
Android Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay I tried this out and here's the update ...
You are getting this XML in response to an HTTP request? Then it's possible that the encoding declared in the XML is different than the charset of the response. The rule in this case is that the charset of the response should be used by the parser, rather than the encoding declared by the XML.

Yes. This XML is a HTTP response. The charset of the response is utf-8(as set in the header Content-Type: text/xml;charset=utf-8). When I open the link in the browser and save it as XML the root tag shows the encoding as "utf-8" also:
<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">


However you're passing an InputStream to the parser, so the parser has no way to find out what was the charset of the response. Try passing the URL of your HTTP request instead and let the parser deal with the response directly. Or alternatively, get the charset from the response and construct a Reader which uses that charset.

As per the javadocs of the Pull parser bundled with the Android SDK, when I call setInput on the parser instance, the parser tries to determine the type of encoding based on certain conditions :
public abstract void setInput (InputStream inputStream, String inputEncoding)

Since: API Level 1
Sets the input stream the parser is going to process. This call resets the parser state and sets the event type to the initial value START_DOCUMENT.
NOTE: If an input encoding string is passed, it MUST be used. Otherwise, if inputEncoding is null, the parser SHOULD try to determine input encoding following XML 1.0 specification (see below). If encoding detection is supported then following feature http://xmlpull.org/v1/doc/features.html#detect-encoding MUST be true amd otherwise it must be false
Parameters
inputStream contains a raw byte input stream of possibly unknown encoding (when inputEncoding is null).
inputEncoding if not null it MUST be used as encoding for inputStream

I tried setting the encoding explicitly as "utf-8" but this still doesn't work; i get exceptions.

When I looked into the HTTP traffic using a sniffer(CharlesProxy) and tried to view the response XML, the tool tells me that there is an invalid unicode character in CDATA and so it cannot parse the XML to fill up the view.
[Failed to parse data: org.xml.sax.SAXParseException: An invalid XML character(Unicode 0x12) was found in CDATA section.]


Maybe I should try creating a reader with appropriate charset(utf-8) and pass that to the parser?
 
Monu Tripathi
Rancher
Posts: 1369
1
Android Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Also I think I should mention this, when I open the feed XML with Opera browser I get parse exception Illegal unicode(0x12) character. Safari does not have any issues; the character encoding for Safari is set to Default which i believe is Western ISO Latin-1.

EDIT: It seems Safari removes those characters, because I don't see accents in the text.
 
Monu Tripathi
Rancher
Posts: 1369
1
Android Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay. Unicode 0x12 is not a valid XML character in a UTF-8 file (according to the XML recommendation of valid character sets). Maybe I should just escape or lop off/ drop off the erroneous entries in the document?? Or should I just ask the providers of the service to fix this at their end??
 
sudhin philip
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I think you are facing this issue due to xml that is not in good format...Here is a work around for you. Just clean up xml, before parsing it....just replace xml's '&', with "& amp;"(ignore the space between "& amp;")..... this may fix your issue.... as this kinds of issue could arrise due to appearance of '&' in the XML data obtained.

eg: if your string (xmlString ), is your XML data, then

xmlString = xmlString.replaceAll( "&", "& amp;" );
[please ignore the space between "& amp;" as i could not produce the same word when saved to the form. ]

will give you an error free XML string for parsing.

Hope it may saved your time atleast few minutes

Regards,
Sudhin Philip.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic