• Post Reply Bookmark Topic Watch Topic
  • New Topic

A program to read "all" of types of files in Java language  RSS feed

 
nash saraj
Greenhorn
Posts: 12
Android Chrome Firefox Browser
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi All,

The file consists of two major components. One is the file header and other part of contents  of the file. I want to create a program that will read first the file header and then the file contents. According to type of the file detected, the code will then process the file contents. While navigating across JavaRanch - A friendly Place for Java Greenhorns, I came across,JavaRanch: How to read headers of a audio file.  But I want to do it for all types of files. Please suggest me means of acheiving this using Java language.

Thanking you,

Regards
 
Marco Behler
Author
Ranch Hand
Posts: 93
5
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
When you say "all type of files", do you mean text files? Binary files? Both? What type of files? Is there a specific list of file types? What exactly do you want to do?
 
Tim Moores
Saloon Keeper
Posts: 4034
94
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
There is no way to read different structured file types with the same code. You will have to decide which file types to support, and then check out libraries that read the ones you want, possibly something like Apache Tika. If the intent is to search file contents, that is not hard to do, but as Marco said, it depends what the objective is.
 
nash saraj
Greenhorn
Posts: 12
Android Chrome Firefox Browser
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Marco Behler wrote:When you say "all type of files", do you mean text files? Binary files? Both? What type of files? Is there a specific list of file types? What exactly do you want to do?



I am  interested in detecting following types of files:-

1. Txt
2. Doc
3. Audio
4. Video
5. pdf files
6. Executables files in linux
7. System related files (.sys)

Tim Moores wrote:There is no way to read different structured file types with the same code. You will have to decide which file types to support, and then check out libraries that read the ones you want, possibly something like Apache Tika. If the intent is to search file contents, that is not hard to do, but as Marco said, it depends what the objective is.



Is it possible to scan all the files using same code ? I am interested to write a code that will first tell the "type"  of file (1-7) and then process the file headers and its contents.  @Tim, Apache Tika is used to scan information necessary from across the file and "meta-data". I have not used Tika, please clarify if meta-data include file headers or not ? Can it scan all information of the file or just its meta-data ?




Thanking you,

Regards
 
Stephan van Hulst
Saloon Keeper
Posts: 7991
143
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What does "process" mean? What do you need to do with each of these file types?
 
Tim Moores
Saloon Keeper
Posts: 4034
94
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nash saraj wrote:Apache Tika is used to scan information necessary from across the file and "meta-data". I have not used Tika, please clarify if meta-data include file headers or not ? Can it scan all information of the file or just its meta-data ?

Tika accesses both the contents and the metadata. I'm not sure what you mean by "file headers"; please clarify. I've used it successfully to implement a Lucene-based search across multiple document types.
 
nash saraj
Greenhorn
Posts: 12
Android Chrome Firefox Browser
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tim Moores wrote:
nash saraj wrote:Apache Tika is used to scan information necessary from across the file and "meta-data". I have not used Tika, please clarify if meta-data include file headers or not ? Can it scan all information of the file or just its meta-data ?

Tika accesses both the contents and the metadata. I'm not sure what you mean by "file headers"; please clarify. I've used it successfully to implement a Lucene-based search across multiple document types.


@Tim i think i might have messed up with the concepts. Please forgive and clarify regarding this. According to my understanding, file headers include certain information such as the type of the file.
A general structure of any file is as follows:-

File Headers/Metadata
File contents

Now scanning further across the web, I came across this, https://digital-forensics.sans.org/media/hex_file_and_regex_cheat_sheet.pdf" target="_new" rel="nofollow"> NETWORK FORENSCIS-CHEAT SHEET 1.0.1.

Hex File Header specifies file format again.


Adding to this was this url List of file signatures which makes me to think that both of them (file header/file signatures) are  essentially same things.


Stephan van Hulst wrote:What does "process" mean? What do you need to do with each of these file types?


@Stephan, I am going to analyse the contents by using various other available API's/softwares such as Stanford Parser.  But calling will be made through the centralized code only.

Thanking you,

Regards
nash
 
Knute Snortum
Sheriff
Posts: 4279
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
nash saraj wrote:
Marco Behler wrote:When you say "all type of files", do you mean text files? Binary files? Both? What type of files? Is there a specific list of file types? What exactly do you want to do?

I am  interested in detecting following types of files:-

1. Txt
2. Doc

Meaning MS Word files?  There are several versions of Word files.
3. Audio

There are many types of audio files
4. Video

There are many types of video files
5. pdf files
6. Executables files in linux

I believe that what makes a Unix-like file executable is its permissions, not any header.  You might be able to parse the "shebang" line, but there are hundreds of types of files it could be.
7. System related files (.sys)

I hope you're getting a sense of the enormity of this project.  What are you trying to do, on the highest level?
 
Knute Snortum
Sheriff
Posts: 4279
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am going to analyse the contents by using various other available API's/softwares such as Stanford Parser.  But calling will be made through the centralized code only. 

Googling Stanford Parser, I see it is a natural language parser.  Why would you want to look in file types such as audio, video, and executables?
 
nash saraj
Greenhorn
Posts: 12
Android Chrome Firefox Browser
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Knute Snortum wrote:
nash saraj wrote:
Marco Behler wrote:When you say "all type of files", do you mean text files? Binary files? Both? What type of files? Is there a specific list of file types? What exactly do you want to do?

I am  interested in detecting following types of files:-

1. Txt
2. Doc

Meaning MS Word files?  There are several versions of Word files.
3. Audio

There are many types of audio files
4. Video

There are many types of video files
5. pdf files
6. Executables files in linux

I believe that what makes a Unix-like file executable is its permissions, not any header.  You might be able to parse the "shebang" line, but there are hundreds of types of files it could be.
7. System related files (.sys)



I hope you're getting a sense of the enormity of this project.  What are you trying to do, on the highest level?


Well, its just file type identification and performing certain operations in Natural Language Processing/Software Engineering Domains.  It is part of a research program and all details are  not available. As a reference, following research paper is being used, Possibility of Interdisciplinary Research-Software Engineering  and  Natural Language Processing

Knute Snortum wrote:
I am going to analyse the contents by using various other available API's/softwares such as Stanford Parser.  But calling will be made through the centralized code only. 

Googling Stanford Parser, I see it is a natural language parser.  Why would you want to look in file types such as audio, video, and executables?


However, what I can confirm is that it is video,audio will be restricted to just identification and not going beyond this. But the other  types will be further be scanned for NLP tools and techniques if it is a SE artifact and SE tools will be applied on NLP tools and artifacts. Hope this is now clear.

Can anyone please give direction to proceed further?

Thanking you,

Regards
nash
 
Knute Snortum
Sheriff
Posts: 4279
127
Chrome Eclipse IDE Java Postgres Database VI Editor
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
All right, let's look into this.

MS Word files: You can guess their file type from the extensions .doc and .docx, but there's nothing stopping someone from creating a file with those extensions that isn't an MS word file.  Assuming it is, you can use Apache POI to parse the files, but it's different for doc and docx files, I believe.

Audio files: You can guess their type from their extensions, and there's a lot of them.  Here's some that Windows Media Player uses and here are some others.

Video Files: These overlap with a bit with audio files, but here are some of the extensions for them.

PDF files: I've never used it, but look at Apache PDFBox for reading PDF files.

Unix-like Executable files: There are probably hundreds of file types that can be executed on a Unix-like system.  Maybe check the file permissions for executable or look at the first line starting with #!.

Plain Text file: You could look for a file extension of .txt or do something more complex like examining the file contents and determining if it's text by counting printable characters versus non-printable.  You would have to look for Window, Unix, and Mac type line separators.
 
Tim Holloway
Saloon Keeper
Posts: 18797
74
Android Eclipse IDE Linux
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In the Linux/Unix OS world, there's a concept called "magic" that's used to recognize file types. It lives at /usr/share/misc/magic and it's used by the "file" program to try and recognize what type of file you are looking at. A related, and more global concept is the MIME (Multipurpose Internet Mail Extension) type, which looks at the filename extension to determine what type of file it is.

Neither of these mechanisms are perfect - files may be mis-named and not all types of files have a recognizable pattern that identifies then - much less a header. However, it suffices for most cases.

Recognizing what a file is is only half the battle, however. Doing things with the file is a whole 'nother ball game. Consider Microsoft Word documents. There's an official file format, and then there's the formats actually created by different versions of MS-Word. Not every file is faithfully displayed by every version of MS-Word.

Plus the sheer number of different file types means that no single program could handle them all. Even the list of similar file types used by converters - the stuff that allows Open/Libre Office to read Microsoft and WordPerfect files and such can be very long.
 
Carey Brown
Saloon Keeper
Posts: 3323
46
Eclipse IDE Firefox Browser Java MySQL Database VI Editor Windows
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I would suggest a two part approach: first write a program that attempts to determine a file's type using one or more of: file name suffix, magic number, or mime. Then, start with one file type (e.g. ".doc") and write a program to extract the data you want. In the first program add code to execute the reader for ".doc". And as you go, write (and test) more and more reader programs. Perhaps even use some existing utilities (e.g. "exiftools") to handle some file types without having to re-invent the wheel.

It would be insane to try and write a single program to handle all file types.
 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!