This
thread was just referenced
elsewhere, so I'll add some more now, moths later.
I don't believe that character entities for illegal characters (e.g. control characters) are any more legal than the characters themselves. Some parsers may allow them, but they're illegal according to the
XML spec. The list of allowed characters refers to parsed entities - meaning it tells you what's allowed
after the character entities have already been interpreted as their equivalent Unicode characters. (Again, see the
spec for a list of what's legal.)
Note also that character values over Byte.MAX_VALUE are by no means illegal. In James Swan's code, I suspect the problem he solved was that there was some confusion over what encoding was used in a file, and so converting to Unicode references solved the problem. But Thomas Goorden's problem seems to be characters like a vertical tab (#x0Bh) or start of text char (#x02). These are well under the Byte.MAX_VALUE limit, but quite illegal nonetheless.
So it seems the best solution is probably to replace them with spaces. From Thomas's comments
here, it sounds like he's on the right track, but the problem is he can't successfully read the characters in the first place to be able to replace them. The exception
UTFDataFormatException: invalid byte 3 of 3-byte UTF-8 sequence (0x3f)
implies that Thomas has successfully created a reader that assumes UTF-8 encoding, but the data is not actually in UTF-8. Contrary to the Microsoft spec - what a surprise. The real problem seems to be finding out what encoding is really used. I recommend just concentrating on creating a reader that can read the whole file without throwing an exception - forget about parsing as XML until you can do that. Example:
Experiment with different encodings (UTF-16 is just one possibility to try) and see how your output looks.
Another option is to make sure the file has a .xml extension, and then open it with Internet Explorer 5.50 or later. Go to View -> Encoding to specify a different encoding to use. (You may need to make sure you've got the appropriate fonts installed, if the file contains foreign characters.) You may not get as many encoding choices as are ultimately available in
Java, but it's easy to use for a quick answer in many cases. Once you know what encoding is
really used, try inserting a proper XML encoding declaration into the file, and try again to parse it. Your problems may suddenly go away.

If not, go back to reading each char and replacing the illegal ones.