This week's book giveaway is in the JavaScript forum.
We're giving away four copies of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js and have Paul Jensen on-line!
See this thread for details.
Win a copy of Cross-Platform Desktop Applications: Using Node, Electron, and NW.js this week in the JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Problem reading binary file generated with IBM1047 encoding  RSS feed

 
Y Zhao
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

We have a binary file generated on a IBM mainframe using IBM1047 encoding. When we read it using FileInputStream, we always get hex 20 instead of hex 00 for low values. The mainframe support people checked on their system and did see hex 00 for those bytes.

Did anyone out there ever experience anything like this or have any suggestions how to read it.

Below is the code:

FileInputStream file = new FileInputStream("C:/temp/testData.out");
byte[] data =new byte[100];
int readBytes = file.read(data, 0, 100);

while (readBytes > 0) {
....
process(data);
....
}
 
Paul Clapham
Sheriff
Posts: 22374
42
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you read it like that, you're not using the IBM1047 encoding. You're not using any encoding at all, in fact, you're just reading the bytes in the file without translating them in any way.

And you said you had a "binary" file -- if that means it doesn't contain text but some other kind of data, it doesn't make sense to talk about its encoding. Encodings are for converting between text (Unicode characters) and bytes.

If you do have text, then you should specify the encoding by using an InputStreamReader, like this:



By the way note that I used "Cp1047" and not "IBM1047"... have a look at Supported Encodings to see why.
 
Y Zhao
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Paul,

Thanks for your reply. Yes, the file is a binary file, the text fields are also written as bytes. We read the bytes in and decode the text field into strings using Charset IBM1047, this way we don't have to get the bytes for the binary fields. If you did before without problem, we'll try your way and get the bytes for those binary fields.

Thanks again for your help.
 
Tim Holloway
Bartender
Posts: 18662
71
Android Eclipse IDE Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
IBM1047 is the EBCDIC character set. Among other things, the space character is 0x40 instead of 0x20 like in ASCII and Unicode (which is ASCII-derived). Other salient features are that digits have higher code values than letters do (0xF1-0xF9) and that there are "gaps" between blocks of letter codes, some of which contain non-alphameric character codes.

If you open this as a text input stream with IBM1047 as the charset encoding, the resulting text that you read should be delivered to you in its proper native Java form (Unicode).

The reason that EBCDIC exists is because back in early Cretaceous period, every computer manufacturer devised its own set of character codes, sometimes on a per-device/CPU basis. Back then, just getting bits read was a challenge, so the codes tended to conform to whatever was cheapest to design into the hardware. EBCDIC was preceeded by a 6-bit code (BDCIC) and influenced by punched-card equipment, which is how it ended up with so many warts. EBCDIC itself rose to prominence when IBM started producing standardized mainframe architectures (S/360, S/360, AS/400 and so forth) as it was their native character set.

Fun fact: EBCDIC was one of the reasons - although not the only one - that the early Internet protocols such as HTTP, SMTP, POP and IMAP are all text-based. The original DARPA Internet consisted of many different brands and models talking to each other and you never could be sure what the native character set of either the end nodes or any intermediate nodes would be.

Character-code translation can be done with a simple table lookup (in fact, the IBM 360 series and descendents could do it in a single instruction). Binary protocols are much trickier, though. There are 3 different ways to order bytes in a 4-byte binary value and the floating-point options were totally insane. Modern-day zSeries mainframes actually have 2 separate sets of floating-point computations, one for the legacy S/360 format and one for the IEEE format that's the spec form for Java.
 
Paul Clapham
Sheriff
Posts: 22374
42
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Y Zhao wrote:Yes, the file is a binary file, the text fields are also written as bytes.


That's still confusing. It doesn't say whether the fields are all text fields, or whether there are some other fields which aren't text, like maybe packed decimal fields for example.
 
Tim Holloway
Bartender
Posts: 18662
71
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul's right. Mainframe files aren't usually straight text unless they're program source code or something like that. Because storage was really, really tight back when mainframes ruled, data files often were mixtures of fixed-size text, packed decimal, binary integer of varying sizes and/or floating point (regular or long). And don't even get me started on variable-length records!

To properly process a record of that type, you'd need a map of its column schema, not just sample data. And for complex records you might just want to convert it externally before using it. ETL programs such as the Pentaho DI or Talend utilities can do that for you.
 
Y Zhao
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The file was written on the mainframe as binary with IBM 1047 encoding. The records in the file with definitions like PIC A(...) and PIC 9(...) packed. Either way reading the data in should give me the same results, shouldn't it? Binary values should remain the same everywhere. They only represent different characters in different encoding. I think using charset IBM1047 to get the chars from bytes or read direct as strings specifying the encoding as Cp1047 should not cause any difference. Correct me if my understanding is wrong.
 
Tim Holloway
Bartender
Posts: 18662
71
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Binary results are DEFINITELY not the same everywhere. The IBM S/3xx stores integers in bytewise-continuous order, the Intel CPUs use a bytewise-discontinuous ordering. An operation known as "swabbing" is required to convert. That is something completely outside of character sets, since characters are not involved.

Also, PIC S9(...) isn't definitive. PIC S9 flagged "COMP(UTATIONAL)" is binary integer. COMP-3 is packed Decimal (a variant of BCD where the last nybble is a sign indicator, COMP-2 is floating-point.
 
Y Zhao
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When I say everywhere is the same, I mean the byte stream saved with same encoding by the same machine. If data provider already give the encoding and it's saved in pure bytes, then the byte values should be the same no matter where it's read.

My question is that if it's possible that the byte value is changed from hex 00 to hex 20.

Has anyone out there ever had problem similar to what we run into?
 
Paul Clapham
Sheriff
Posts: 22374
42
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You keep talking about bytes written using an encoding. Which implies that there is TEXT being encoded. However you're talking about "low-value" which shouldn't be appearing in a text string. So I'm still confused. You're going to have to examine what is being encoded for a start, which is something you haven't described.

As for "pure bytes" there's no such concept. A byte is a byte, a sequence of 8 bits, and any byte is just as "pure" as any other byte. Which, as you say, should be the same no matter who is reading it. So if you're alleging that somebody wrote a byte containing hex-00 and your Java code reads that same byte as hex-20, then no, that isn't happening. You could examine the bytes in question with a hex editor if you like, although you should be careful that you aren't viewing the file through some software connection which is converting EBCDIC to ASCII under the covers.
 
Tim Holloway
Bartender
Posts: 18662
71
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes, the FileInputStream will read raw bytes without translation. Although your loop isn't correct. It should be more like this:


Otherwise you'd A) loop forever, since you only read once. and B, possibly attempt to process bytes in a "garbage" area of the data buffer, since the buffer size is fixed, but the number of bytes actually read can vary.

The character code \x00 is NUL in both ASCII and EBCDIC, but Java doesn't like the NUL character, largely because of its special meaning (EOS) in C programs, partly because it's too easy (again in C) to cast to a NULL value/pointer. Java isn't C, but it tends to be wary of C quirks.

So it's possible that if you passed a NUL to a character/string function in Java it would translate. A binary buffer wouldn't do that on its own, though.

It would be a good idea to run a hexdump utility on the system that's going to be reading the file, just to ensure that the translation didn't occur before the file reached you, regardless of what the mainframe people said. A lot of times, they're unaware of behind-the-scenes code translations. For example, FTP servers for Unix default to untranslated (binary) mode, but FTP servers for Windows default to translated (text) mode.
 
Y Zhao
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for all the replies.

Basically, we are just trying to see if anyone else ever had experience like this before. We already checked with the parties who prepared the data and the team who did the ftp to make sure they set binary when the file was ftped from mainframe to our system.

When I say pure bytes I mean the data fields defined as pic X or pic A are also saved as their corresponding byte values instead of characters. Other than the data byte issue,we have the code logic to read and process records, only omit some details to make the message shorter. Sorry for the confusion.

Maybe there are some unaware conversion happened. We already discussed internally among related parties about this and it did not look like the case.

Thanks again for all the effort people put in trying to help.
 
Tim Holloway
Bartender
Posts: 18662
71
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I cannot say that I have encountered the specific problem you mentioned, but I spent several years working with mainframe-to-PC data transfers on all sorts of data, using IND$FILE, FTP, physical reel and cartridge tape drives attached direct to PCs, you name it. So there aren't many things I cannot handle in that regard.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!