I'm doing some work on an XML Parser. I've got a class that wraps a BufferedReader, and then provides access to the characters in the text file during parsing via the Buffered Readers read() method.
I'm having some trouble counting the number of lines in the text file while I parse it. (I need to recognize the newline characters as I read characters from the file one at a time. I can't use the readLine() method.)
Here is the code I hopes would recognize the newline characters in a platform independent way:
LineIndex is a local variable that keeps track of the newline characters. However, I ran this code in a JUnittest with a text file on my windows XP machine, and lineInxex never gets incremented. Is there a problem with the above code?
That system property is 2 characters long on Windows, so if you compare 1 character at a time to it, you will never have a match.
However: if you are writing an XML parser, then that logic should not be in your program anyway. XML is designed to be platform-independent and its rules on how to handle line-feeds and carriage-returns do not vary based on the operating system.
Edit: You should be comparing the characters to '\n' for line-feed and '\r' for carriage-return. [ December 01, 2006: Message edited by: Paul Clapham ]
What do you want with those newline characters anyway? They're completely irrelevant to the XML structure of the document. In fact many network services will strip them (and other whitespace) to reduce bandwidth requirements and download times (mostly relevant for slow connections or extreme volumes of course).
I agree with the previous comments, and have some more:
[Landon]: Here is the code I hopes would recognize the newline characters in a platform independent way:
Unfortunately, using System.getProperty("line.separator") is really a platform-independent way to access platform-specific behavior. You can use it to find out what the line separator is on the current machine. But what if you're parsing a file that was written on another machine? Or written using an application that uses a different concept of line separator? Many people write Java programs that just use \n for a line separator, even on Windows. While this may be regarded as "wrong", it's common.
You should probably take a look at what the XML specs say about line breaks. You've got a choice between XML1.0 and 1.1:
The thing to watch for here is that \n\n is two line breaks, \r\r is two line breaks, but \r\n is just one line break. So you need to write some logic to handle this. If you use XML 1.1 you need to check a couple other characters as well.
[Jeroen]: What do you want with those newline characters anyway?
One possible use: if you're writing a parser, it's nice to keep track of the current line number so that if you need to report an error, you can accurately describe where it is.
"I'm not back." - Bill Harding, Twister
Jeroen T Wenting
posted 12 years ago
ah yes, hadn't considered that.
Would probably be better to include the relevant fragment of the document in the error message at the very least though, as (as I said) a lot of XML doesn't contain line breaks but is a batch stream of data.
posted 12 years ago
Yeah, I'd be inclined to include line #, column #, and an excerpt of the surrounding text. Some parts of that may be less useful for some applications than others, but all those have a good chance of being useful to some users at least.
I'd like to thank all of the posters for their comments. They've helped me realize that this is a little trickier than I thought it was.
I'll post to an XML developers list that I subscribe to. Perhaps they'll have some suggestions on the best way to handle this. I may have to get back to you guys on some help with implementing the solution.
posted 12 years ago
I've been chewing on this problem some more. After reading the responses I received, I think I need to try and detect any character or combination of characters that are specified in the XML specification.
Following the XML specification sounds right to me, if you are planning to write an XML parser. But if you're asking how to count lines as you do that, I don't believe the XML spec has anything to say about counting lines. So you can really do what you like there. I suppose it's up to you how you deal with, for example, normalization of attributes. If an attribute contains a line-feed character, which you normalize by removing it, did that line-feed still count as a line ending for the purpose of your line count?
I have to ask, why are you writing an XML parser anyway? Don't any of the existing parsers satisfy your requirements?
All of the world's problems can be solved in a garden - Geoff Lawton. Tiny ad:
RavenDB is an Open Source NoSQL Database that’s fully transactional (ACID) across your database