• Post Reply Bookmark Topic Watch Topic
  • New Topic

XML file stream reading giving issues  RSS feed

 
Mansukhdeep Thind
Ranch Hand
Posts: 1163
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi there

Have been caught in this frustrating problem for a couple of days now. I'm attempting to read from a buffered stream which has code something on the lines of :



The stream is basically an XML file with attributes and their values. The problem is that when it reaches a particular tag which has 2 attribute=value pairs, for ex, <Tag attr1=val1 attr2=val2> </Tag> , it concatenates the value of the first attribute and the name of the second attribute when reading that line, so it becomes <Tag attr1=val1attr2=val2></Tag> And when parsing that XML string using JAXB, the unmarshalling API spits out an exception saying that the element <Tag> must be followed by /> or attribute.

Any ideas?
 
Paul Clapham
Sheriff
Posts: 22697
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One thing I notice is that your posted code reads the input, line by line, and appends the line to some object, dropping the line-feed characters between the lines. So if a fragment of your XML looked like this:



then that object would contain this substring:

<Tag attr1="val1"attr2="val2">


Which is indeed not valid XML. There's no reason to drop the line-feed characters from your XML, the parser will deal with them appropriately, and as you can see it's possible that dropping them makes the XML malformed. So I'd start by not doing that any more.
 
Mansukhdeep Thind
Ranch Hand
Posts: 1163
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is the part of the XMl that's causing the issue, the symptom-category tag. It has 2 attributes and there is a whitespace between value of first and name of second one.



Why should it then concatenate it while reading and give me :

(no space)
 
Mansukhdeep Thind
Ranch Hand
Posts: 1163
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham wrote:One thing I notice is that your posted code reads the input, line by line, and appends the line to some object, dropping the line-feed characters between the lines. So if a fragment of your XML looked like this:



then that object would contain this substring:

<Tag attr1="val1"attr2="val2">


Which is indeed not valid XML. There's no reason to drop the line-feed characters from your XML, the parser will deal with them appropriately, and as you can see it's possible that dropping them makes the XML malformed. So I'd start by not doing that any more.


That's what's happening, it is, indeed, taking that single line of the XML as 2 separate line and then appending it in the string buffer. But I'm still trying to figure out why. Cause I already spaced the line properly in the XML.
 
Paul Clapham
Sheriff
Posts: 22697
43
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Does the problem go away if you ask JAXB to parse directly from the InputStream? (In other words is it really necessary for you to make a modified copy of the data in memory before parsing it?)
 
g tsuji
Ranch Hand
Posts: 697
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Does it not matter if you happen to close symptom-category twice, as you can see what you posted yourself ?...
 
Tony Docherty
Bartender
Posts: 3270
82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I suggest you add a System.out.println() statement to print out the value of 'line' before you add it to docContent. This will show if your symptom-category tag contains a newline char as Paul suspects is the case.
However, If it is prints out that line as one line without the space between the tags then you need to check in your xml file to see what character that whitespace character actually is - it clearly isn't a standard ASCII space character. Write some code to read in that line byte by byte and dump the byte values to the console, an ASCII space is 20H.
 
Mansukhdeep Thind
Ranch Hand
Posts: 1163
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tony Docherty wrote:I suggest you add a System.out.println() statement to print out the value of 'line' before you add it to docContent. This will show if your symptom-category tag contains a newline char as Paul suspects is the case.
However, If it is prints out that line as one line without the space between the tags then you need to check in your xml file to see what character that whitespace character actually is - it clearly isn't a standard ASCII space character. Write some code to read in that line byte by byte and dump the byte values to the console, an ASCII space is 20H.


I replaced the readLine() with read(). Reading character by character resolved the issue. It was reading a single line in the XML as 2 separate lines. Thank you for you suggestions everyone.
 
Campbell Ritchie
Marshal
Posts: 56202
171
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not happy with that solution. The read() method is really awkward to use, and may give slower performance than readLine() (not certain about performance). I do my level best to avoid read(), so there shou‍ld be a better way to do this with readLine. Why are you not using an XML parser? Can you check whether each line is closed with a tag corresponding to its opening tag, and if not concatenate two lines?
 
Paul Clapham
Sheriff
Posts: 22697
43
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:I am not happy with that solution. The read() method is really awkward to use, and may give slower performance than readLine() (not certain about performance). I do my level best to avoid read(), so there shou‍ld be a better way to do this with readLine. Why are you not using an XML parser? Can you check whether each line is closed with a tag corresponding to its opening tag, and if not concatenate two lines?


The better way to do this with readLine() would be to append a line-feed character to the StringBuilder (or whatever it was) after each line. That's because the readLine() method drops the line-feed character after each line, thus damaging the XML document.
 
Mansukhdeep Thind
Ranch Hand
Posts: 1163
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:I am not happy with that solution. The read() method is really awkward to use, and may give slower performance than readLine() (not certain about performance). I do my level best to avoid read(), so there shou‍ld be a better way to do this with readLine. Why are you not using an XML parser? Can you check whether each line is closed with a tag corresponding to its opening tag, and if not concatenate two lines?


I'm using the JAXB methodology to unmarshal the XML. Makes life easier than using a SAX/DOM parser. I made sure that the XML is formatted properly and that all the tags are properly closed. That didn't resolve the issue either. I understand your concern of not using read(). But I couldn't for the life of me , figure out why it was unnecessarily taking a space for a line feed and appending it to the StringBuilder.
 
g tsuji
Ranch Hand
Posts: 697
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you look at the posted xml, you can discover id for the symptom-category appears exactly like this:
id=="MechanicalIssues-?\3589BF"
whereas if you copy-and-post it to a text editor, it would appear like this:
id=="MechanicalIssues-8A3589BF"

Now, the particle 8A (in the place of ?\) appears everywhere in the @id, more specifically at the beginning of 2nd component of it after the hyphen. For instance, the first symptom's id is read like this:
id="BlownFuse-8A3589C0"
See the appearance of 8A?

You've to check what happens to the symptom-category's id, using any hex32 editor, to check at byte-level, how is it encoded?
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!