next (possibly) you'd start assembling these based on frequency into a binary tree.

You need a data type that can store a binary tree, then you would convert the initial Map into a TreeMap<Integer, List<BinaryTree>>

You need a List here in case more than one tree has the same frequency.

while the size is greater than one, pop the top two items from the TreeMap, combine and put them back into the TreeMap

Node class

Leaf class

Nleaf class

part of the main class

AyanBiswas

SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6 - OCEJPAD 6

How To Ask Questions How To Answer Questions

How do you know whether a Node is a leaf or not when you first create it?

Suppose there are three distinct characters in a fille a,b&c.These 3 characters along with their frequencies make the leaves.Now, suppose I pop a&b (say ,their frequencies are least) and the combine them (i.e add their frequencies and ascii values) and create the first non-leaf node and put it in the TreeSet<Node> again.By ,this way i proceed .

In this for loop in the main method I create all the leaf nodes(with its frequency,ascii code and sequence no.).Sequence no. was created to design the compareTo method efficiently.

AyanBiswas

AyanBiswas

Ayan Biswas wrote::I firstly converted the ascii codes in to their corresponding integer values and wrote them in a file

Please elaborate what you mean here -- ASCII codes are already a number. There is no conversion to "corresponding integer".

Ayan Biswas wrote:Now ,while decoding I read each byte from that file and converted it into its binary equivalent and searched for their ascii values in the map.This approach is creating the following problem.Since,01,001,1 all have integer equivalent as 1,so when I convert the codes into integer it no longer remains unique.

Again, please elaborate.... because with numbers a one is a one is a one -- regardless of how many zeros are in front of it. There is no difference between a ASCII code one and a ASCII code zero one.

Henry

Ayan Biswas wrote:

:I firstly converted the ascii codes in to their corresponding integer values and wrote them in a file

I meant I converted the huffman codes(which are of type string in my case) of the characters(actually I stored the ascii of the characters ..not the character itself in the tree map) into corresponding integers.so,say huffman code for 'a' is 101 .then the integer will be 5.

Ayan Biswas wrote:

Now ,while decoding I read each byte from that file and converted it into its binary equivalent and searched for their ascii values in the map.This approach is creating the following problem.Since,01,001,1 all have integer equivalent as 1,so when I convert the codes into integer it no longer remains unique.

Again,here I meant huffman codes.Say huffman code for 'b' is 010 and for 'c' it's 10.So,when I convert 010 and 10 to integer both gives 2.

AyanBiswas

ASCII codes are numbers. There is no concept of how many leading zeros in a number. So the conversion from Huffman code to the number is a lossy conversion. And there is no way to guarantee to convert it back correctly.

Henry

AyanBiswas

Ayan Biswas wrote:@Henry thanks for the idea.But using the codes in string format wont be much helpful.because firstly the encoded file will be much larger in size and while decoding (say10011) how will I know to search for '1' or 10 or 100 ..But I will try to use the ascii codes and their leading 0's in another Map and give it a try.

Oh... I thought you were referring to the mapping -- not the encoded output.

Basically, you can try to do this for the encoded file. Say you have five hoffman codes:

001 101 01 100 110

Instead of storing in as five ascii characters with the binary 00000001 00000101 00000001 00000100 00000110. You need to mash the bits together, and store it as 00110101 10011000.

This not only gives you much better compression than what you have now. But you don't have to worry about the leading zero problem. And hoffman codes don't have a trailing zero problem based on how it is designed.

Henry

But the problem is ,suppose i have d=011,whose integer value is 3 ,same as b's.So,codes are no longer unique.this is the problem that i am facing.

here's my code to decode the file

AyanBiswas

Ayan Biswas wrote:

But the problem is ,suppose i have d=011,whose integer value is 3 ,same as b's.So,codes are no longer unique.this is the problem that i am facing.

Read my last post again -- if you mash the bits together, you can tell the difference. Mainly...

a b --> 10111000

a d --> 10101100

There is a difference.

And BTW, you can't have b=11 while c=111. Huffman codes don't work that way.

Henry

But the problem I am facing now is that my huffman codes are String type.So if the code for a is 101 then its actually taking 3 bytes to store the code.So,the encoded file size is larger than the actual file!What am I going to do?What is the idea to read and write bits from a file in java?

AyanBiswas

By the way: if I remember correctly, each character in a String is stored as a

`char`, so a String like "101" would require 6 bytes, not 3 (plus pointers to the Class<String> object, etc etc).

You can try parsing the "101" String with 2 as the radix to get 5, but you are probably better off getting the codes into numbers initially.

By the way: if I remember correctly, each character in a String is stored as a char, so a String like "101" would require 6 bytes, not 3 (plus pointers to the Class<String> object, etc etc).

I had tried the approach of converting the codes into numbers initially,but it did not work for me because I was not able to save the uniqueness of the codes so i tried the tree traversal approach.it worked but I failed with the compression of data.Any other way to do it?

And the encoded file size is coming as the no.of 0's and 1's in the file.So,if the file consists 10101 the size=5 bytes not 10.

AyanBiswas

Campbell Ritchie wrote:Bear has earlier showed you how to compress your file by putting different combinations of bits together. If you are using Strings, then 10101 is at least 10 bytes, if you get it into a number, then 10101 is 5

bits.

Agreed. Not only is it a difference of 10 bytes versus 5 bits per character -- but if you further mash the bits together across many characters, you won't have the "can't tell difference between 10 and 010" that you have been encountering either.

Henry

Mashing of the bits seems a bit difficult to implement.Say,a="101" b="10" now while reading the file i first read a and find the code 101 now since this is less than 8 chars i read the next byte ,and that is 10,so i concatenate it with 101,and continue this until string length is less or equals eight.Now,suppose while reading the next byte the string length becomes more than 8 ,so what I do is that i store the extraneous bits in another string and read the next character.Is this the way forward?This kind of seems messy to me.

AyanBiswas

Here is the code where I am converting strings of 0 ,1 into numbers.

Here is my code where I am trying to re-convert numbers into string of 0 and 1

AyanBiswas