Bookmark Topic Watch Topic
  • New Topic

Searching a FileInputStream  RSS feed

 
James Kimble
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Report post to moderator
I'm reading in a binary file and searching for key markers to determine what action to take with the data that follows the marker. I'm using a FileInputStream with a buffer to read the file.

My search algorithm was to read a buffer sized chunk (from 64K to 1Mb), search that buffer for my pattern and save how many of the patterns occured in that buffer and repeat the read accumulating the total pattern match count for the file. Changing the buffer size gave me different numbers of pattern matches so I knew I was doing something wrong. My guess was that my pattern (which is 32 bits) was being split across buffer reads so changing buffer size would mean different numbers of matches. I wrote a quick fix that would append the last (pattern_length - 1) bytes from the previous buffer read to the beginning of the current buffer prior to searching for my pattern. This just made things much worse...

I finally just started looking at where the different buffer sizes were finding my pattern and it turns out that all the differences were at the very end of the file where, it appears, the last buffer reads past the end of the file getting garbage data. Disguarding the bad matched at the end gives the correct number of patter matches. I don't understand that...

Don't I have to worry about searching for a pattern when reading a file in buffered chunks like this? Isn't having the pattern split across reads a possibility?

I've attached the code. It's just something I threw together for testing so please don't be too critical....

///////////////////////////////////////

import java.io.*;
import java.util.Arrays;
import java.lang.System;


public class ParseBytes
{

/////////////// NOTE ////////////////////////////////////////////
// Buffer size of 64K and 128K gave similar times. Larger buffers
// were actually slower...

//public static int BUFFER_SIZE = 1048576; // 1MB for buffering
//public static int BUFFER_SIZE = 524288; // 512K for buffering
//public static int BUFFER_SIZE = 262144; // 256K for buffering
public static int BUFFER_SIZE = 131072; // 128K for buffering
//public static int BUFFER_SIZE = 65536; // 64K for buffering
//public static int BUFFER_SIZE = 15;
public static short SYNC_WORD[] = { 0xfd, 0x53, 0x53, 0xfd };

static int byte_count = 0;

// Used to save end of last buffer to check for patterns that are
// split across buffer reads
public static byte end_buf[];


public void print_byte ( byte buf[] )
{
for (int i=0; i<buf.length && i><10; i++)
{
System.out.printf ("0x%02x ", buf[i]);
}
System.out.println("");
}


/**
* Method: searchBuf
*
* Purpose: Searches thorugh array buf for the number of occurences
* of pat. Designed to be called multiple times with adjacent
* buffers. The end of the latest buffer is appended to the
* beginning of the current one to be sure that the pattern
* pat wasn't split across buffers.
*/
public int searchBuf (byte buf[], short pat[])
{

int end_length = pat.length-1;

byte search_buf[];

////////////////////// TEMP ///////////////////////////////
search_buf = buf;
/*
// First time through just allocate array for storage,
if (end_buf == null)
{
System.out.println ("Creating end_buf");
end_buf = new byte[end_length];
search_buf = buf;
}
else
{
// Append last 5 bytes of previous buffer to the beginning of our
// search buffer to catch patterns that split at end of buffer
search_buf = new byte[(end_buf.length + buf.length)];

System.arraycopy(end_buf, 0, search_buf, 0, end_buf.length);
System.arraycopy(buf, 0, search_buf, end_buf.length, buf.length);
}
*/

int j = 0;
int bi = 0;
int frames = 0;

for (int i=0; i < search_buf.length; i++)

{
int search_buf_element = (0x000000ff & ((int)search_buf[i]));
if ( search_buf_element == pat[j] && j < pat.length )
{
j++;

if (j == pat.length)
{
//System.out.println("#########################################");
//print_byte(buf);
//print_byte(end_buf);
//print_byte(search_buf);
//System.out.println("#########################################");
System.out.printf("byte_count: 0x%8x\n", byte_count);
frames++;
j = 0;
}
}
else
{
j = 0;
}
/*
if ( i >= (search_buf.length - end_length) )
{
end_buf[bi++] = search_buf[i];
}
*/
byte_count++;
}

return frames;
}


public void readIn (String fn)
{
try
{
FileInputStream fis = new FileInputStream(fn);

int n;
int cnt = 0;
int frames = 0;
byte buf[] = new byte[BUFFER_SIZE];
byte tmp_buf[] = new byte[BUFFER_SIZE];

while ((n = fis.read(buf)) != -1)
{
// We have to be sure this buffer read didn't split
// on a seach pattern
frames += searchBuf (buf, SYNC_WORD);
tmp_buf = buf;

for (int i = 0; i < n; i++)
{
if (buf[i] == '\n')
{
cnt++;
}
}
}

System.out.println ("LAST BUFFER: ");
for (int i = 0; i < tmp_buf.length; i++)
{
if ( (i % 16) == 0 )
System.out.println (" ");
System.out.printf ("%02x ", tmp_buf[i]);
}

fis.close();
System.out.println("There were " + cnt + " lines and...");
System.out.println(frames + " frames");
}
catch (IOException e)

{
System.err.println(e);
}
}


public static void main (String args[])
{
if (args.length != 1)
{
System.err.println("missing filename");
System.exit(1);
}

ParseBytes pb = new ParseBytes();
pb.readIn(args[0]);
}
}
 
Bear Bibeault
Author and ninkuma
Marshal
Posts: 65833
134
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Report post to moderator
Please do not post the same question more than once.
    Bookmark Topic Watch Topic
  • New Topic
Boost this thread!