I am new to UNIX and UNIX scripting. I have a log file which is sitting in UNIX server. The log file is big enough and contains so many 50,000 lines(approx). I want to write a UNIX shell script to remove duplicate data from the file and make a smaller file(may be 100/200 lines or something like that). I really don’t know how to write that script. It would be great if someone can help me with some sample script for that. Please let me know if you need any other information. Thank you...
Could you show us some example log lines (both duplicate lines and unique lines)?
Do the lines have timestamps in them? If so, then filtering will be much more difficult.
I'm sure someone with Lisp expertise could write a one liner (with lots of parenthesis) to do this, but if I were to do it I would have to use Python, PHP or some other higher scripting language; I wouldn't event want to think about how to do it in bash.
I have not written the application. So I do not know if the if they(who wrote the application) are using Log4J. Only I am having the log files and I have to reduce the file. This is the idea. As I am very new to UNIX and UNIX scripting I am wondering if we can write some unix shell script to remove the duplicate record from the log and make it smaller and if it is possible can you please send me some sample code for that how to do that. I would really appreciate that.
I would first try to find the log4j configuration file and edit it to remove the duplicates.
I think that a general algorithm for printing only unique log entries would be:
The "read next line from log" is a little complicated because you have to read the entire log entry which appears to be multiple physical lines based on what you displayed. The algorithm assumes that duplicate entries will be adjacent. I would be comfortable tackling this in Java, or perhaps in Python or PHP. Someone might be able to do this in a line or two of lisp. I wouldn't even try this in bash (thought I'm sure it could be done).