• Post Reply Bookmark Topic Watch Topic
  • New Topic

How to parse a combination file that contains text and image?  RSS feed

 
John McDonald
Ranch Hand
Posts: 112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi there,
I have a project that I need to parse a file to extract the text and images (tiff, gif, or jpeg). Any idea is much appreciated. Thanks.

John
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to extract will depend on how the file is written.
My crystal ball is in the shop so you will have to be more complete in your description of the problem.
 
John McDonald
Ranch Hand
Posts: 112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks William.
I tried to think how to organize my question. O.K. the file that I need to crack as follows:

I read them in bytes. And only extract the ascii so far. I don't know how to due with the binary byte that has value in hex. Any of your input is very valuable to me. Thank you very much.



***********************



The format of the PDI files we can read look like this:

++----------------------------------+

| <52 byte header containing claim number, etc> |

++----------------------------------+

| <129 byte COMBINED HEADER> +

++----------------------------------+

| <32 byte FILE HEADER (for Status file)> |

++----------------------------------+
| < "Status" FILE (496 Bytes)> |

++----------------------------------+

| <32 byte FILE HEADER (for each additional file)>|

++----------------------------------+
| <FILE (additional files (estimate data, jpegs)> |

++----------------------------------+




In PDI files the "Status" file always appears,
and our software always looks for it.
It's format is documented in the original PDI specification documents.



The format of the incoming 'print image' files currently look like this:

++----------------------------------+

| <52 byte header containing claim number, etc> |

++----------------------------------+

| <'print image' FILE > |

++----------------------------------+



What we need to do to make it work is format the 'print image' the same,

That is we need to make it look like this:

++----------------------------------+

| <52 byte header containing claim number, etc> |

++----------------------------------+

| <129 byte COMBINED HEADER> +

++----------------------------------+

| <32 byte FILE HEADER (for Status file)> |

++----------------------------------+
| <"Status" FILE (496 Bytes)> |

++----------------------------------+

| <32 byte FILE HEADER (for 'print image' file)> |

++----------------------------------+
| <'print image' FILE> |

++----------------------------------+



Here are the field definitions:



The <52 byte header containing claim number, etc> looks like this:

Bytes: 0-2 (STR) "HDR"

Bytes: 3-4 (SHORT) number of image segments (for 'print image', this would be 00)

Bytes: 17-18 (STR) "IM"

Bytes: 19-51 (STR) Claim, Interested Party, and Supp information.


The <COMBINED HEADER> is defined as the next 129 bytes after the initial

header, formatted as follows absolute byte positions in brackets, Byte counting starts at zero.)

Bytes 0-6(52-58): (STR) "COMB:"

Byte 7(59): (CHAR) 0x05 (hex 05)

Bytes 8-127(60-179):

Byte 128 (180): (CHAR) 0x1A (hex 1A)




The <FILE HEADER> is defined as 32 bytes formatted as follows:
( absolute byte positions in brackets, Byte counting starts at zero.)

Bytes 0-8(181-187): (STR) stored file name

Bytes 9-12(188-191): (STR) stored extension

Bytes:13-14(192-193) (SHORT) file type id

Bytes:15-16(194-195) (SHORT) not used

Bytes:17-20 (196-199)(LONG) uncompressed file size

Bytes:21-24(200-203): (LONG) compressed file size

Bytes:25-26 (204-205)(SHORT) date of file

Bytes:27-28 (205-206)(SHORT) time of file

Bytes:29-31 (207-209)(CHAR) reserved
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
See if Byte.parseByte() would help. You'd read two characters, eg "0A", and parseByte to get a byte with value of 10 and put a bunch of those bytes into an array or write them right to the output file.

Scanner looks like it can do this on the fly. Set the radix to 16 and get nextByte. I wonder if you couldn't use Scanner for the whole file ... do you always know what's coming next?
[ January 05, 2006: Message edited by: Stan James ]
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13078
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It looks like you have all the information you need, you just need to approach it methodically. About the only thing that could trip you up is "byte order" - you need to figure out if (for example) a short int is written with the low order byte first or the high order byte first.

In Java math with bytes you must remember that each byte gets promoted to an int - with sign extension! - before anything else happens so you will see stuff full of masking with 0xff and shifting by 8 bits when assembling a Java int value from a byte array. Example:

short n = (0xff & b[0]) + ((0xff) & b[1]) << 8 ;

where the byte order is low byte first. (off the top of my head, hope I got it right)

If this was my problem I would write a class for each of your identifiable data chunks, each having the methods to extract a value as needed and to read or write the required byte[].

Bill
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!