Win a copy of Java 9 Revealed this week in the Features new in Java 9 forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

EBCDIC sort order in Java  RSS feed

 
Jake Patuli
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a simple ArrayList<String>.
I would like to have it sorted according to the EBCDIC sort order rules.

An example result would look like this:
AAAA
BBBBBB
BBB2017
BBB2016
CCC
65D
 
fred rosenberger
lowercase baba
Bartender
Posts: 12464
43
Chrome Java Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So an example is great, but it doesn't define things.

Can you explain the exact criteria for an EBDDIC sort? Why does BBB2017 come before BBB2016 (that's just a for example...).
 
Campbell Ritchie
Sheriff
Posts: 54033
130
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Welcome to the Ranch

I suggest you remind yourself about ordering in the Java™ Tutorials and then work out how to write a Comparator. I am not aware of any built‑in EBCDIC sorting.
Does anybody till use EBCDIC?
 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Here you find much information: click

Here is a short code that I tried, seems to work: (never worked with EBCDIC, I hope the charset used is correct)
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm surprised that code works, if it does. Your test case is not nearly extensive enough to convince me.

When you do s.getBytes(), you encode the string to whatever your platform's default charset is. Then you reinterpret those bytes as EBCDIC, which would potentially give you complete garbage.

If the sorting rule is according to each character's encoded value, then you need to encode the string to EBCDIC, and then compare the byte array, NOT the reinterpreted string.
 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hmm... after some tries I think this is not working. More complicated then I thought I agree with Fred: what is EBCDIC sorting?

if you look at the table here: wiki
then you see many characters that are not usually part of an ordinary string. So, a way might be to put the ebcdic "normal" characters in a Map<Character, Integer> and then compare two strings character for character by means of this table. Only problem is when a character is not present in the map.
 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hi Stephan,

yes mine is incorrect, although I'm not sure where it goes wrong. I thought of what you have, and it looks mighty, but I don't trust anymore what these 'bytes' are. I'll work out what I just described, then at least I can see what is happening.
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Piet Souris wrote:So, a way might be to put the ebcdic "normal" characters in a Map<Character, Integer> and then compare two strings character for character by means of this table.

That's basically what my algorithm does. The map you talk about is called Charset. The getBytes() method just maps each character in the string to the integer from your map, except they're bytes, not integers. Then it compares each encoded character.
 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for explaining, I'll cease my efforts. But if I look at that Wiki table, I notice that the letters are not continguous. Does that mean that sorting can give some strange outcomes? For instance: "I-" seems to come before "IJ". Or does "EBCDIC" ordering have some special rules?
.
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
String.getBytes(Charset) wrote:This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

 
Jake Patuli
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks everyone for contributing, I have learned a lot.
Well the issue here is the EBCDIC sort order.
I have normal ASCII string data in an ArrayList<String>.
I would like to sort that list according to the EBCDIC sort rules.
The EBCDIC sort is where letters are higher than numbers.
Note that the data is ASCII in an ArrayList<String>.

Let’s assume that there are only numbers and letters in each string.

An example result would look like this:
AAAA
BBBBBB
BBB2016
BBB2017
CCC
65D
 
Liutauras Vilda
Marshal
Posts: 3961
214
BSD
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Initial OP's example wrote:An example result would look like this:
AAAA
BBBBBB
BBB2017
BBB2016
CCC
65D

Latter OP's example wrote:An example result would look like this:
AAAA
BBBBBB
BBB2016
BBB2017
CCC
65D

Those two are different. Not to confuse everybody, specify what are you up to.
 
Henry Wong
author
Sheriff
Posts: 23026
120
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jake Patuli wrote:
I would like to sort that list according to the EBCDIC sort rules.
The EBCDIC sort is where letters are higher than numbers.

Let’s assume that there are only numbers and letters in each string.


If all you are concerned with are "numbers and letters" then it is really simple. Just write a simple Comparator, and let the Collections class sort algorithm sort the List.

If you want to support all of EBCDIC, it is still simple; it is just more tedious. Your Comparator first needs to convert to the EBCDIC value which is used to compare.... and the tedious part would be generating the table for the conversion.

Henry
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Jake Patuli wrote:I have normal ASCII string data in an ArrayList<String>.

No you don't. Strings are ALWAYS UTF-16. You mean that your strings only contains characters that can be encoded as ASCII. This is an important distinction, as has been demonstrated by the bug in Piet's code.

Regardless, with the example I have given in this thread, what is the problem you encounter?
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:If you want to support all of EBCDIC, it is still simple; it is just more tedious. Your Comparator first needs to convert to the EBCDIC value which is used to compare.... and the tedious part would be generating the table for the conversion.

Not really. The table already exists in the form of a Charset.
 
Tim Holloway
Bartender
Posts: 18548
61
Android Eclipse IDE Linux
  • Likes 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
EBCDIC has several significant differences from ASCII and Unicode when it comes to collating sequence. In ASCII, digit codes sort below alphabetic codes, but the reverse is true in EBCDIC. I'd have to double-check, but I'm thinking that lower-case and upper-case may be reversed as well. Lower case was not all that common in traditional EBCDIC, so I didn't pay much attention. A number of special characters are in different collating positions and not all EBCDIC and ASCII characters have a corresponding value on the opposite code page.

Also, as has been noted, there are gaps between blocks of character codes in EBCDIC. Why? Because EBCDIC is the EXTENDED binary coded decimal interchange code - it worked its way up from 4 bits (BCD) to 6 bits (BCDIC), to finally 8 bits and got passed through 2 generations of keypunch machines on the way. If you've ever looked at how IBM cards are punched you can see where it comes from. Back then, you adapted the codes to the hardware, not the other way around.

The java.lang.String is Unicode. So any character strings in EBCDIC need to be arrays of java.lang.Character with one of the EBCDIC code pages used rather than ASCII, UTF-8 or Unicode. If you do that right, probably your basic compare functions will work without the need to write any special collating code.

 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting topic! Never dealt with charsets before, so this is a fine opportunity to learn a thing or two. Thanks all for contributing.

I noticed this:
If you create these byte-arrays with "getBytes" then, since bytes are signed in java, you get negatve values whenever the "original" byte value >= 128. For instance,

gives [-41, -119, -123, -93, 64, -30, -106, -92, -103, -119, -94].

That means that you can't directly sort on the byte values, but that you must use an int[] array, where array[i] = byte[i] if byte[i] >= 0, else array[i] = byte[i] + 255, or adjust the Comparator accordingly.
 
Campbell Ritchie
Sheriff
Posts: 54033
130
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why can't you sort on those byte values?

Let's start by looking up some EBCDIC on Wikipedia. You find out it is an 8‑bit encoding which like ASCII seems to have tunnel vision and only recognise letters used in English
So you are working in 8 bits, which means every character shou‍ld be able to fit into a byte. Your printout of the array appears to confirm that, because there are no 0s in it. But why are you getting negative numbers? Answer: because there are lots of values whose leftmost bit is 1 (i.e. in hexadecimal their value is ≥ 0x80). As you suggest, PS, those will all fit nicely into an int. And remember that an int is the default type for integer arithmetic. But don't try casting to ints because you will get sign extension and your comparisons won't work. I suggest you try the following, which will produce an int as its result because the operator does promotion first. It's all in the Java® Language Specification (=JLS).
b & 0xff
I am not giving a more complete solution just yet.
 
Tim Holloway
Bartender
Posts: 18548
61
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:Why can't you sort on those byte values?


Well, you could, but EBCDIC is supported by several standard Java code pages, and since the point here isn't to sort by absolute code value but by character collating sequence (which isn't the same thing at all), it's a lot easier to use the characterset/codepage facilities that are part of the core Java implementation and let them do what they're designed to do.

Aside from saving a lot of work, this also makes the code more re-usable, since collating is an abstract process and that moves the character-set dependencies to the periphery.
 
Piet Souris
Rancher
Posts: 1828
61
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Campbell Ritchie wrote:Why can't you sort on those byte values?

If you have the unsigned values x = 127 and y = 128, then x comes before y. If signed, then x = 127 and y = -128. In that case y would come before x.

Campbell Ritchie wrote:But don't try casting to ints because you will get sign extension and your comparisons won't work.

Well, that's why I suggested something else   
 
Campbell Ritchie
Sheriff
Posts: 54033
130
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Holloway wrote:. . .  the point here isn't to sort by absolute code value but by character collating sequence . . .
I missed that bit. Sorry.
 
Stephan van Hulst
Saloon Keeper
Posts: 7190
118
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Holloway wrote:So any character strings in EBCDIC need to be arrays of java.lang.Character with one of the EBCDIC code pages used rather than ASCII, UTF-8 or Unicode. If you do that right, probably your basic compare functions will work without the need to write any special collating code.

This makes no sense to me. java.lang.Character is Unicode. You can't set a code page on a Character or a Character array.

EBCDIC collation sequence is done by ordinal value of the encoded character. All you have to do is encode the string to EBCDIC and then compare the byte arrays.
 
Tim Holloway
Bartender
Posts: 18548
61
Android Eclipse IDE Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Sorry. I couldn't remember if it was Char or Byte. It's been a year or 2 since I last dealt with it and memory-rot sets in fast these days.

Native EBCDIC collating sequence is indeed based on unsigned byte comparison. Collation is an abstraction in Java, though, so if you prefer you can use a collator that sorts digits low the way the native collators for ASCII and Unicode do.

 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!