Win a copy of Classic Computer Science Problems in Swift this week in the iOS forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Checking if a file is binary/ascii in Java  RSS feed

 
Ranch Hand
Posts: 185
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Is there an easy way to check if a file is binary/ascii in Java?
 
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A quick look at the ASCII character set shows it only uses the first 127 byte values. You could read with a stream and see if you find any bytes with value over 127. That would make them negative in Java, wouldn't it.

Uh oh, the "extended ascii" set uses all 255 values. With that, you're out of luck.

Can you define "binary" and "ASCII" any better ... what kinds of files are you likely to run into?

http://www.lookuptables.com/
[ May 04, 2005: Message edited by: Stan James ]
 
Ranch Hand
Posts: 76
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's hard to give a useful answer without more information than you've given here, but the problem you raise is a general one: how does one tell by looking at the contents of a file what kind of file it is?

The short answer is: it can't be done in general. You can use various heuristic tricks and make educated guesses. (For example, compressed files typically come with specific bytes at the very beginning; I'm told that viruses and worms tend to have signatures that betray them; and so on.) But there isn't a general-purpose algorithm for looking at a stream of bytes and saying, "Yes, that's an executable" or "No, that's just an email message." So you won't find a class in the Java API that enables you to do this.
 
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would focus on the characters lower than 32.
Every character except '\t', '\n' and '\r' would indicate a binary file in my eyes, but as mentioned before, the term 'binary' isn't so clearly defined - at least not to me.

And in the end every file is binary - isn't it?
 
Alok Pota
Ranch Hand
Posts: 185
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
By binary I mean, non-human readable. Basically, I have a search/replace program that runs on a directory with deep nested tree structure (several files and directories). I want to skip binary (non textual files) so as to speed up the program.
 
Ranch Hand
Posts: 580
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why don't you give it a file filter so that it knows what types of files (probably by file extension) to process?
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For a "guess" you might read a thousand bytes and see if they are all "printable" as defined by regular expressions. See Pattern in the JavaDoc for a start on that. That could give you some level of confidence short of absolute certainty that a human might be interested in the file's contents.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!