Process large file as String

Aug 07, 2013 15:36:49

I have a quick question and needing some direction and thoughts.

The situation I am in is that I am trying to figure out the best way forward to tackle a problem in regards to processing log files, so a bit of back ground might help.

Basically, I am building a monitoring solution, and have come into a bit of a problem in regards to log file sizes. The way that the solution works is that you have a log file containing various outputs from log4j and unfortunately due to the design of the system being monitored values and strings I am searching for are spread across multiple lines, so reading and processing and matching by line is not possible.

The way I tackled it is, basically you have a log file on the server, I keep track of where the file has been read up to, and then send either a tail or a sed command to grab the lines from the file that have not yet been read. The nature of the system is that only complete transactions are written to the log file, and are a combination of XML and other line entries. This is an important thing to note.

What happens then is that after the lines have been read into a String the are past to a class to be processed and matched against a regular expression and stored in a database. That way, I for administration all I have to do is tell the system what log files to monitor and what I want out of them and where to store it.

Now the problem comes in when those log files, if they haven't been read before and are over a certain size (around 20Mb). Smaller files don't seem to have any issues, and I am a little confused as to what is going on, as when I look at the length of either the String or StringBuffer they are around the 13 million character mark. When I do a StringBuffer.toString(), it is as if nothing is passed, and from what information I can find, I am well below any limits of maximum characters for a String, (but maybe I am).

The work arounds I had thought about, and have started to play with was using sed to grab certain sections of the log files, and trying to make sure that it is as robust as possible (i.e what I would like to try and avoid is reading log the log files line by line and looking for certain markers, as want to try and avoid making it look for only certain things in the log files as markers for the ends of transactions, as would like to make it as robust as possible so that it could be really anything passed to it).

The issue with doing it in sections is that I might possibly miss a transaction that sits between each of the grabs, as some of the transactions can be over a 100+ lines.

So ideally, would like to read the data into a String and then pass it to be processed. JVM size is OK, and not even hitting that limit.

Anyhow, would be interested in any recommendations or idea's, and thank you in advance for your help.

Aug 07, 2013 16:04:12

Jace Sim wrote:Anyhow, would be interested in any recommendations or idea's, and thank you in advance for your help.

Well, the first thing that springs to mind is: why do you have 13 million characters in a StringBuffer?

The second is that you say on the one hand:
"The nature of the system is that only complete transactions are written to the log file"
and on another:
"I might possibly miss a transaction that sits between each of the grabs, as some of the transactions can be over a 100+ lines"
I understand that tail isn't the most sophisticated tool in the world (old Unix sysadmin), but either your transactions are getting written in one go or they aren't. Personally, I'd be looking at some form of log switching before I start to pull lines; but there may be other things that prevent you from doing that.

I'm also not sure of the nature of this program. Is it a daemon that simply sits at the far end of a pipe from a tail? Or is it something that you run periodically to process your logs?

However, all of these questions aside, my advice would be to process the file as lines, not as Strings, and store them in a List<String>. If you need to bang them together for some other processing (eg, XML), do that as needed; but I can't imagine that a few hundred lines of XML is likely to tax a StringBuffer too much.

If this is indeed a daemon then obviously, you will need to periodically remove "processed" lines from your List before you continue, but that, again, should be a fairly simple operation.

HIH

Winston

Aug 07, 2013 16:13:46

Now the problem comes in when those log files, if they haven't been read before and are over a certain size (around 20Mb). Smaller files don't seem to have any issues, and I am a little confused as to what is going on, as when I look at the length of either the String or StringBuffer they are around the 13 million character mark. When I do a StringBuffer.toString(), it is as if nothing is passed, and from what information I can find, I am well below any limits of maximum characters for a String, (but maybe I am).

What do you mean it is as if nothing is passed? Perhaps you should try to send the result of the toString() to a file to see what happens. If it comes up empty then you'll know where to look for the problem.

Overall I must admit I am a bit confused about what you're trying to do as you haven't provided any example code and so the whole process is very blurry to me, but let's work through this to figure things out.

Oh, and do you need thread synchronization? If not, I would suggest replacing StringBuffer with StringBuilder.

Aug 07, 2013 16:23:21

Winston Gutkowski wrote:
Well, the first thing that springs to mind is: why do you have 13 million characters in a StringBuffer?

Mainly because it is the result of the grab of the tail from the file. In this case, if it hasn't seen the file before it will process it, and yes there is a risk that I need to work around in regards to what if the file is absolutely massive, but 20Mb isn't that massive a file. What happens is that I have a servlet running that spawns threads across multiple systems, and makes SSH calls to the various boxes I have in my monitoring list. The reason I am doing it as a String is because although it is large, I am not overly concerned with the time it takes, as it is pretty quick as it is, and that I am doing matching, not on a line by line basis, (trying to make it as flexible as possible, so that I am not coding for particular examples every time I want to match something. Basically, want to be able to say, here is the log file, (be it all items on one line, or multiple), this is the database table I would like you to store it in, and this is the regular expression I would like you to apply to the string (sometimes numerous depending on the log file as looking for different things). The whole idea is so that you aren't coding for particular things, but rather it is database and regular expression driven, given a set String. I am trying to avoid having to read the data, put it to a file, then go through and process it. (although in certain circumstances I am hiving off a copy of the file).

Winston Gutkowski wrote:

The second is that you say on the one hand:
"The nature of the system is that only complete transactions are written to the log file"
and on another:
"I might possibly miss a transaction that sits between each of the grabs, as some of the transactions can be over a 100+ lines"

There are two stages here.... The system being monitored only pumps out to the log file when it has the full transaction. The bit I am talking about are the small grabs of lined data, say for example when I run sed -n '260014,280014p' on a file, there could be a possibility that a transaction lies between line 280001 and 280100, and if I was to simply grab that chunk and pass it to be processed, then it will not match, because it hasn't got the whole transaction for the regular expression to match.

Winston Gutkowski wrote:

I understand that tail isn't the most sophisticated tool in the world (old Unix sysadmin), but either your transactions are getting written in one go or they aren't. Personally, I'd be looking at some form of log switching before I start to pull lines; but there may be other things that prevent you from doing that.

Ideally yes, me too, however, don't have that luxury.... due to policies etc, and a business adverse to any changes to implement it. Don't worry, have thought about that too..

Winston Gutkowski wrote:

I'm also not sure of the nature of this program. Is it a daemon that simply sits at the far end of a pipe from a tail? Or is it something that you run periodically to process your logs?

Yeah, basically just run a servlet, that has a bunch of hosts that need to be monitored, and it constantly polls the boxes and processes log files. Remembering where it got up to, and then issuing a command to only grab new information from the files that it already hasn't processed, pass it through a regular expression and then store the data in various tables in a database. Basically the idea is to be as less intrusive as possible, and only monitoring and gleeming what information I need out of those log file.s.

Thanks for the help Winston... it is a bit of a difficult one..

Aug 07, 2013 16:34:10

Paul Mrozik wrote:
What do you mean it is as if nothing is passed? Perhaps you should try to send the result of the toString() to a file to see what happens. If it comes up empty then you'll know where to look for the problem.

What I have done to try and trouble shoot is this. I do a resultfromreallybigfile.length() and it comes up as having 13084332 characters. I then do a

and print out to a file and it comes up empty.

So from my thinking is that the data is in the resultfromreallybigfile StringBuffer, by doing the length call, but when it gets dumped out to the String it fails, although far from elegant, I don't believe that I am hitting the limit of how big a String can be from what research I have done.

Aug 07, 2013 16:41:41

I guess the reasons of why am doing it the way I am doing it has a few considerations :

Mainly I have some data in log files, that I need to be able to run multiple regular expressions over, and based on what I have and how the regular expressions match determines what I do with the data.

I have some lines of data that are on single lines, and some that are over multiple lines, and also scattered throughout the log file, in a way to try and match things up and reconcile them. It works great on smaller files, and does the job, except for when you start getting in the region of 20mb files, while they are large, they aren't excessive in my mind, as the JVM size is 2G and not even going anywhere near that.

The issue I am trying to figure out, is why the conversion from StringBuffer to String fails, and gives me nothing. That is the weird bit. If I do it in smaller chunks ie 5mb it works perfectly, but the 20mb even though, there is data in the StringBuffer, it doesn't dump it to the String.

Aug 07, 2013 17:51:17

Jace Sim wrote:Mainly I have some data in log files, that I need to be able to run multiple regular expressions over, and based on what I have and how the regular expressions match determines what I do with the data.

Right. Well, I think you have to rein yourself in here. It sounds to me like your requirements are simply too broad:

I have input of unlimited size.

I have regexes that need to be able to scan any number of lines and find matches across those lines.

I need to do it quickly.

I need to do it in a limited space.

This is not the recipe for a rational solution.

My first suspicion is that your "regex" is not in fact a regex at all, but some sort of structured condition or search. Regexes were designed to find pattern matches in lines; it's only more recently that they've been expanded (wrongly, IMO) to become catch-all matchers for any darn String we care to throw at them, and now people are thinking they can do anything. They can't, and you shouldn't expect them to. Regexes should be short and sweet and understandable.

And one thing they are definitely NOT good for is parsing structured text such as XML (I wonder if this is what you're trying to do?). You need a proper parser such as SAX or DOM for that. And you need logic.

Without more info on what you're trying to do, it's difficult to advise; however, I'd definitely rein in (or break up) your "matching" requirements, because it sounds to me like that's where your problems are stemming from.

Winston

Aug 07, 2013 18:27:15

Yeah, do think am trying to achieve the impossible. Although, must say it is reasonably quick for how much and what it is doing.

Basically what I have is a database that stores the directory and filename and server on which it applies, along with the regular expression(s) that need to be applied to the incoming data. What I have decided to do, (and something that I wanted to avoid) is put into that a list of special characters in which defines the completion of the transaction, then process a smaller String find the line number of the last match and then request the next chunk of data from that line on wards. The data has a big mix of stuff, some stuff that resembles XML but not really, so have to treat it as non XML as well as other chunks of data. That is the main problem here, is that it is chunks of data, and also having to match thread ID's in order to track things through out the system, as it flows through business processes.

The main idea here is that want to be able to define things to look at in log files without having to do physical coding within the application and entirely within the application. So for example, if there happens to be a log file that you are interested in, all you have to do is put it in the database, along with the regular expression, the column names and table where you want the data put, and that is it. It then takes the care of the rest. This then puts the data into a database in which you can then generate graphs/reports etc based purely on SQL.

Basically then, this gives you the flexibility to say, I am now interested in getting something different from the log files, either from this point on, or if you want to go back and review historical data, run the new matching over the hive of log files you have previously kept, making it completely flexible and quick to adjust to new things to look out for.

What makes it more difficult in this circumstance is that the things you are looking for are across multiple lines, and having to deal with data in a way that you can't change or modify due to restrictions.

The main idea is to have a log file, a completely adjustable regular expression matching system, and being able to massage the data in a way that you want without having to modify source code, etc. Part of it also then based on SQL generates graphs etc on the fly. I had to fabricate my own graphing library to deal with over 2 million transactions an hour, and plot every single response time, which works extremely well and does it all within 15 seconds. Reason being is that the system being monitored contacts numerous other systems, so you need to be able to pin point transactions of various response times, and identify when you have an issue.

In the previous implementation of this though, I had all responses on single lines, which made things extremely easy, as it was multithreaded and multi machine and was extremely scale-able, and worked really well. Trouble is, new job, and new environment and needed to modify it to cater for what is currently existing without modification to the existing system to be monitored, mainly due to restrictions on making changes to the current existing system, at the same time try and put some monitoring in that is more than just monitoring CPU usage and memory, which was previously happening.

Aug 07, 2013 18:28:02

Thanks for the help too... greatly appreciated..

Aug 07, 2013 18:56:40

Jace Sim wrote:The data has a big mix of stuff, some stuff that resembles XML but not really, so have to treat it as non XML as well as other chunks of data. That is the main problem here, is that it is chunks of data, and also having to match thread ID's in order to track things through out the system, as it flows through business processes.

Well, it sounds to me like you have major problems:
1. You're trying to apply an existing (and, it would seem, brittle) program to data it was never designed for.
2. You have "some stuff that resembles XML but not really" that you're still trying to parse.
3. Not even sure where thread ID's come into it, or "track[ing] things through out the system".

It sounds to me like you need to StopCoding (←click) and get a handle on ALL these things.

What is that "XML but not really"?

Are regexes really applicable to your new data?

What is all this "thread ID" and "tracking things through the system" stuff.

You need to sit down with a pencil and paper (lots of it) and:
(a) Find out why you're being supplied with data that doesn't - apparently - conform to any known standard.
(b) Write a new spec.

Winston

Aug 07, 2013 19:12:41

Maybe this might explain things :

Log file

1: Transaction1 : kaljksjdfljaslkdjflksajdflkjsadljflkdsajf
2: TransactionOrder: jjkjjkjdjkjjd
3: kasdkljdjf
........
56: [END OF TRANSACTION ORDER]
97:TransactionOrder : asdfasdf
98: kjlkjljljlkj
99: kjalkjsdf
100: kjasldjflasjdf
101: jdjdjf
101: Transactoin19: jljaslkdjflsajd
102: jasdjjdk

If for example I read in lines from 1-99 and grabbing 100 lines at a time, I would miss the order transaction that started at line 97, as the next chunk of data wouldn't contain the initial match. Part of the issue as well, is that the lengths also vary, and I don't want to have to manage part transactions in between log file grabs.

As far as the system is concerned all it should be worried about is, here is a bit of data, this is the regular expression I want you to apply, and this is where I want you to put it, and that is it. No hard coding of how to deal with a log file, as it is all database driven. This is the system, this is the log file I want you to monitor, and these are the regular expressions I want you to look for, and when you run the regular expression, this is where I want you to put the data. That is probably the best explanation of what is occurring.

The way I have designed it is, in the database, you have the regular expression with a given unique regularExpressionID. Then in a separate table I have the mapping of the regular expression to say, this is the column name of where I want the matched data to be put, and the format of that data.. i.e it is an int, date, a combo field (contains a combination of any of the fields you list), any other extra SQL you want to run along side. So as the regular expression is matched, it then does an insert of the data into the database, creating dynamically via prepared statements the SQL to insert the data dealing with chunks of already prepared data at a time, and doing inserts on a particular number of entries, or when there is no more data to process. The idea is behind this is that I plan to use a webpage in order to build your regular expressions, and then automatically create the tables for you, but that is more of a longer term plan I haven't gotten around to yet.

Anyhow, moving forward what I have decided to do is, read in chunks, set a regular expression for what would usually determine the end of a transaction or start of a new transaction and set that as the next point as to read in data. Would have needed to do this in any case, just in case I did have to process a massive log, at least then it would be covered. Just have to make sure that my size of the data I am grabbing is going to hopefully encompass a period of time that would hold multiple transactions from start to finish.

So I have a log file, with say 10,000 lines. I would read in 1,000 lines, look for the last occurrence of a particular regular expression that would indicate the end of a transaction, say it occurs at line 980, then on the next chunk of data, read in from line 980, and read the next 1,000 lines so that would bring the chunk of data from lines 980 to 1080, do the search again for the last occurrence, and then basically just read chucks of data between occurrences of ends of transactions.

Then what happens for each grab of data, pass it off to a thread to do the matching, the idea being is that multiple log files can be processed simultaneously. The way that works is that all it needs is the chunk or String of data, what regular expression(s) it needs to apply and it figures out the rest within the thread as to what to look for in the data, based on a few other identifiers passed through it then stores the data in the database.

The ultimate aim at the end of the day, is to have a completely flexible monitoring system, that looks for things within the data, be completely portable between systems processing different types of data, and at the end of the day store whatever you want to store in a database.

The reason I have chosen to do things as Strings is that no matter what you have, if you have data, and you need it matched against a regular expression it does it on the fly, quickly, and without firstly writing it to a local file, and then, when it gets time, dump the data out to a file, but most importantly getting the data processed and displayed as quickly as possible, without having much impact on the system being monitored. I have always approached monitoring to be as non intrusive as possible, and do very little processing on the host machine. I could do things as doing greps etc on the host before, but see that as a negative as it could potentially take resources away from the system than doing a simple give me the data, and process it off host. I have always been adverse to having monitoring agents on machines.

Hope this explains things a little more, and thanks again for your help, it is greatly appreciated.

Aug 07, 2013 19:43:55

Jace Sim wrote:The ultimate aim at the end of the day, is to have a completely flexible monitoring system, that looks for things within the data, be completely portable between systems processing different types of data, and at the end of the day store whatever you want to store in a database.

OK, but you're not thinking about this empirically. You're applying all sorts of constraints (of which, I suspect, regexes are your main problem) to your solution before you even know what kind of data you're going to be dealing with. ie, You're deciding HOW you're going to do this before you know WHAT you need to do - always a bad move.

Regexes are good, but they're NOT a panacaea; and my general rule of thumb is that if they don't work in one line, then I need to find some other way of doing things. grep, awk and perl work well precisely because they were designed with a limited scope; but perl as an "object-oriented" language or grep as a multi-line matcher? Perlease.

Furthermore, you're dealing with an input stream that appears to adhere to no recognised standard that I know of, and so could presumably be changed by your supplier any time they feel like it. This is NOT a good basis for a "generic" solution.

One possibility, just for consideration, is to replace your regexes with a more sophisticated form of matching. How you do that? Not sure, because I don't know enough about your app; but I'm pretty sure you're going to have to let them go, or use them in a more complex framework if you want to get your job done.

HIH. 'Fraid it's bedtime for me.

Winston

Aug 07, 2013 19:45:52

Winston Gutkowski wrote:
Well, it sounds to me like you have major problems:
1. You're trying to apply an existing (and, it would seem, brittle) program to data it was never designed for.
2. You have "some stuff that resembles XML but not really" that you're still trying to parse.
3. Not even sure where thread ID's come into it, or "track[ing] things through out the system".

The system being monitored is a poorly designed implementation of a process server, with multiple paths, multiple downstream systems, and multiple dynamic paths that just doesn't work, and at the moment with no source code, or even one shred of documentation what so ever, and then there is the non ability to modify or even view what is happening other than what is currently being pushed out to several log files, all in different formats, different locations, and containing all little bits of data that I am trying to meld together to get at least some view of how things are flowing, when things are broken, and basically how it has all been put together. In that mix, it also has a pretty consistent memory leak, which I am trying to identify and then at least alert when we are going to get a particular process kicking off that occurs on an adhoc basis, and at least try and hone in on where the leak is coming from.

The problem then is, I have to be able to apply this to other systems that are in the same state. I have started at a job, who have had absolutely no IT governance and is a right mess.

The biggest problem is that I have to work with what I have in regards to the logs. They aren't pretty and contain a mix of all sorts of things, however there are bits that I can try and at least work with to try and figure out what is occurring and have some monitoring to see when things get stuck in the process and then hopefully try and determine the cause.

Winston Gutkowski wrote:
It sounds to me like you need to StopCoding (←click) and get a handle on ALL these things.

What is that "XML but not really"?
Are regexes really applicable to your new data?
What is all this "thread ID" and "tracking things through the system" stuff.

You need to sit down with a pencil and paper (lots of it) and:
(a) Find out why you're being supplied with data that doesn't - apparently - conform to any known standard.
(b) Write a new spec.

Winston

The output from logs is hard coded in the system, and does not contain a lot of information. So it is going through and trying to figure out based on threadID's which are thankfully in some of the logs to try and figure out at least some picture of where things have failed in their travels through the system.

I use the regular expressions (there are a few of them) to extract the various bits of data from the logs. All is working OK at the moment overall, as have managed to get a much clearer picture of what was going on, and have highlighted a few things. The monitoring works extremely well, except for when I cleared all the data and read the data from scratch processing older log files, and only one in particular which was 20mb. All of the hard thinking had been dealt with..

except for how to process a chunk of data as a String from a 20mb file.

I also chose the regular expressions so that later on I can be flexible in adding additional things to be extracted from the log files as I go along. Basically have a lot of data, some useful and most of it not, but as I am going along and trying to correlate things there has been the occasion that I have need to grab some extra information once I find out what the data being logged is actually telling me. i.e some identifiers are hidden within strings, sometimes, but not always, so for example.. order number 76788jdhhd might be displayed in one log as 76788jdhhd but in another it might be 776768-ddf76788jdhhdads-998dj4 or then even in shortened version such as 76788jd but then I can match up sometimes the threadID on which the process kicked off on, all in multiple logs, all over the shop.

Believe me, would love to stop coding, but have to move forward in trying to get some useful monitoring other than CPU and memory being used on the box. The whole thing works perfectly, just got stumped by the String size.

The trouble is, people's lives are actually at stake with this system, that is the scarey bit.. and I have to do whatever I can to get something in to make sure it is working until the new solution comes in which is over a year away.

Aug 07, 2013 19:53:53

Winston Gutkowski wrote:

Jace Sim wrote:The ultimate aim at the end of the day, is to have a completely flexible monitoring system, that looks for things within the data, be completely portable between systems processing different types of data, and at the end of the day store whatever you want to store in a database.

OK, but you're not thinking about this empirically. You're applying all sorts of constraints (of which, I suspect, regexes are your main problem) to your solution before you even know what kind of data you're going to be dealing with. ie, You're deciding HOW you're going to do this before you know WHAT you need to do - always a bad move.

Winston

Yeah, that is why I had gone down this route, as the data being processed isn't always consistent, and the monitoring system needs to be adaptable completely. The Regex's work extremely well, and got that nailed. Was just the problem of reading too much data into a String and it not working. Yeah, the logs are completely non standard, but all I have to work with. The whole apps works fantastically well if your data String is under 5Mb, over that, it dies. Going to code up to read in chunks but get it to hopefully get complete transactions within those chunks and fire them off to be processed. That side of things works really well. This implementation has pushed me to cater for data over multiple lines, which I have never had to deal with before, as in past systems I have always ensured that data for analysis is on one line and complete. Trouble is, have walked into this job, and it is all a complete shambles, and have to try and give some clarity to what is actually occurring in the system with what limited logs I have, be they in a complete all over the shop mess.

Thanks for your help mate, greatly appreciated.

Aug 07, 2013 20:04:22

Winston Gutkowski wrote:Regexes are good, but they're NOT a panacaea; and my general rule of thumb is that if they don't work in one line, then I need to find some other way of doing things. grep, awk and perl work well precisely because they were designed with a limited scope; but perl as an "object-oriented" language or grep as a multi-line matcher? Perlease.

Yeah, trouble is trying to keep things clean, and in one package and not rely on other external languages etc..

When I initially wrote my application I did it all in perl, which was great, but then needed to build an interface etc to do all the other things, so then decided to purely focus on making it a Java app, and then also try and limit running resource hungry things on the boxes I am monitoring.

Only stump was the size of the String I needed to process, but think the work around is going to be the best solution.

Aug 08, 2013 05:53:54

Jace Sim wrote:I also chose the regular expressions so that later on I can be flexible in adding additional things to be extracted from the log files as I go along.
[...] The whole thing works perfectly, just got stumped by the String size. [...] but think the work around is going to be the best solution.

I'm not quite sure what "workaround" you're talking about, but it sounds to me like you've got a lot of time and love invested in your original model and you're finding it hard to let go of it, even in the face of evidence that it's no longer valid.

Believe me, would love to stop coding, but have to move forward [...] The trouble is, people's lives are actually at stake with this system, that is the scarey bit.. and I have to do whatever I can to get something in to make sure it is working until the new solution comes in which is over a year away.

Whoa. "Lives at stake"? Sounds like the sort of emotive thing a manager might say to get me to burn extra candle hours to complete a job he knows is unreasonable.

If lives are indeed at stake, then this is precisely when you need to StopCoding and look properly at what needs to be done, because you need to get it right. You will never code your way out of a mess.

Just a thought, and I have no idea whether it'll work or not: Rather than a single regex of almost unlimited scope, what about two?: a 'start' expression and and 'end' one? I'm presuming that your current one has ".*?" (or something like it) in it somewhere, so why not turn it into two expressions that only need to search at most a couple of lines at a time?

However, I fear there may also be some embedding or hierarchy involved in these "transactions"; and if that's the case, regex is NOT the solution (except possibly in a very limited way). You need a parser.

HIH

Winston

Aug 08, 2013 07:39:51

Yeah, I would take a step back here too. I have speed-read through this thread, and form what I see this going downhill.

Have you thought about using a OTS solution like Splunk to do your log analysis. What you are doing here seems a lot like reinventing the wheel

If I were to reinvent this wheel, I wouldn't put any kind of business logic in the log parser and I wouldn't use a database. The bit that reads the log files and parses them should
a) have some rudimentary filtering.
b) tokenize the logs
c) store the logs in a database that provides an inverse index (for example Lucene, and even a NoSQL database like MongoDB will work well)
d) Build an UI that can do built in searches based on your business logic, and either custom ad hoc searches

I can see couple of pitfalls with your current design:-
a) your logic seems very tightly coupled with how the application is logging information. That coupling is bad. Developers generally do not think twice before changing the log messages. The log file cannot be an "interface".
b) How the heck do you scale this? Really? One machine pulling logs from multiple hosts and grinding away pushing data into database? What happens when you add hosts? or some developer adds a log statement that doubles the size of the logs? or an Ops turns on FINE logging?

Aug 08, 2013 16:05:25

Jayesh A Lalwani wrote:
Have you thought about using a OTS solution like Splunk to do your log analysis. What you are doing here seems a lot like reinventing the wheel

Splunk is ok, but it is very limited in getting graphing data out that is meaningful.

Jayesh A Lalwani wrote:
If I were to reinvent this wheel, I wouldn't put any kind of business logic in the log parser and I wouldn't use a database. The bit that reads the log files and parses them should

That is the whole idea not to have any business logic in the log parser. The whole idea is to have basically a dumb system that processes logs pulling out what you need.

Jayesh A Lalwani wrote:
I can see couple of pitfalls with your current design:-
a) your logic seems very tightly coupled with how the application is logging information. That coupling is bad. Developers generally do not think twice before changing the log messages. The log file cannot be an "interface".

That is the whole idea of having the regular expressions in a database, that way you don't have to continually change code, to pick up or modify the match on items in the log file you just modify the regular expression in the database and that is it. No hard coding of anything.

Jayesh A Lalwani wrote:

b) How the heck do you scale this? Really? One machine pulling logs from multiple hosts and grinding away pushing data into database? What happens when you add hosts? or some developer adds a log statement that doubles the size of the logs? or an Ops turns on FINE logging?

Scales pretty easy really.... cause all you need is the host you need to connect to, the log location and what the regular expression is you want to apply. That is all it needs so then it can just go out and gather. You could then also make it so that it batches up SQL if you wanted. It is all multi-threaded as well, so it does multiple hosts, and way it goes and connects away.

Have a work around now that if the log file is larger than a certain number of lines, it processes it in chunks, but making sure whatever you set as the end of transaction it fits in between. It doesn't bring back the whole log file, only bits that it hasn't already processed. Keeps track of inodes and manages when logs flick over as well.

Aug 08, 2013 16:50:10

Winston Gutkowski wrote:
Whoa. "Lives at stake"? Sounds like the sort of emotive thing a manager might say to get me to burn extra candle hours to complete a job he knows is unreasonable.

yeah, it is supprising but have to wait until the new system comes along, as this one is a complete mess and has been only in since October, and already gearing up for redevelopment. Extremely poor design and absolutely wrong product selection.

Winston Gutkowski wrote:
Just a thought, and I have no idea whether it'll work or not: Rather than a single regex of almost unlimited scope, what about two?: a 'start' expression and and 'end' one? I'm presuming that your current one has ".*?" (or something like it) in it somewhere, so why not turn it into two expressions that only need to search at most a couple of lines at a time?

Thought about that too.... main issue was that there are bits throughout that make the transactions different. Then you get into the mess of a nested chain of regular expressions..

Yeah, using .*? seems to work really well. Overal it works really well, and surprisingly fast.

Winston Gutkowski wrote:
However, I fear there may also be some embedding or hierarchy involved in these "transactions"; and if that's the case, regex is NOT the solution (except possibly in a very limited way). You need a parser.

Yeah... It does work well though....and works good enough over multi line now as well.

No one can make you feel inferior without your consent - Eleanor Roosevelt. tiny ad:

a bit of art, as a gift, that will fit in a stocking

https://gardener-gift.com