I have been asked to design a system that monitors the log file on a real time basis and report issues if any (like application failure, threshold breach for specific exception etc). I know that there are some standard tools available for such functionality like Filewatcher from AWS, but my firm does not want to invest in any tool and asked me to develop tools in house with some basic features. My language of choice is java and shell scripting. Can you please advise what should be design approach since the challenges that I think of are the following
1. actively monitoring the log file - this means running a process in parallel to the specific application being monitored to and constantly read the log file. I am not sure what's the best way to read a log file which is being constantly written to
2. passing monitoring the log file - possibly run a program every 30 sec. The program does the following
2.1. take a snapshot of the log file
2.2. compare the line count against previously stored snapshot (done 30 sec ago)
2.3. read the contents of the newly added lines and determine if anything happened. Also possibly maintain state for exception count in a secondary storage
I am open to suggestions for better design and also can choose Python for my work if any such functionalities are easier to implement.
I would suggest looking at Nagios. Nagios is available free/open-source, although it's certainly worth paying for when you want enterprise-grade support.
The good thing about Nagios is that even free and out-of-the box it can monitor many critical system resources, group resources by categories and perform many different types of notifications including email and SMS. It's very flexible, although perhaps a bit confusing just reading the docs.
I use it extensively to not only monitor the physical condition of my hosts and VMs, but also to ensure that critical apps are running properly, that I'm not running out of disk space and in fact today I'm adding monitors to my primary servers to ensure I get notified if I'm dumping too much junk in my root directory, since that's where I tend to experiment and if I'm not carerful the backup server (also monitored by Nagios) will end up producing a multi-gigabyte weekly backup when it should normally be only a few hundred K. All in all, I continously monitor about 60 key resources just on my small R&D server farm.
There are some Nagios plugins for checking log files and whether they're being produced, and in fact, if someone does decide it's worth paying for, there's a commercial Nagios logserver product.
But for inexpensive do-it-yourself stuff where an available plugin won't do what you want, you can very easily create your own plugins in Python, Perl, Shell Scripts or whatever. I wouldn't use Java, though, since that requires a JVM to spin up and run for each Java plugin and I poll about every 5 minutes myself on most resources.
In the case of a logfile, you're most interested in streaming data, rather than in polling, so you would have a little extra work to do, but Nagios is routinely used to monitor SNMP traps and that's got the same considerations, so there's documentation on the web that can help.
Loudly announcing something is true and finding out you're wrong makes you feel foolish.
Finding out you're wrong and refusing to admit it makes you LOOK foolish.
Except that "cheap" may not be as cheap as expected. When you buy a product you usually get support. If you build it, not only will you have to spend time on development (which isn't free, you still have to get paid), but eventually you also have to provide support (again, not free).
So the simple solution - use a well-maintained, free, open source product (like Nagios) to get the best of both worlds - support (probably through the community) without much costs other than the initial investigation / setup.