Win a copy of Kotlin in Action this week in the Kotlin forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic

Advice on file parsing  RSS feed

 
tom davies
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am creating a log file parser which will eventually store the various log types into the correct database table. To start with i am just trying to parse the individual log types. Most of the log types are easy and only span one line so i can check the type and search for the parts i need and that is fine.
Other log types span multiple lines and my problem is that i cant figure out how to tell if a certain log type has ended or not. Below i have added an example for you of the multiple line spanning types. The added problem is that they don't always span the same amount of rows and don't always have the same attributes included. Currently i am reading the file line by line, splitting it into an array of strings with a whitespace delimiter and then checking the array for keywords such as ciaddr and then storing the value of ciaddr as that element in the list+2 so that i can the ip address after the equals sign. This is fine for checking just one entry but i have no way of telling when the log type has ended and another has begun. Could someone help me out with a simple solution as i think i have over thought the problem and now am drawing a blank

Cheers

 
Tony Docherty
Bartender
Posts: 3268
82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you show an example of single line logs as well.
 
tom davies
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Tony Docherty wrote:Can you show an example of single line logs as well.

I have added some single line logs to the original post
 
Tony Docherty
Bartender
Posts: 3268
82
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Assuming the lines shown are an accurate representation of all log lines then the only way I can see of finding the continuation lines is by the fact the details part of each continuation line is indented.
If that is the case then you could trim each line to remove leading and trailing whitespace and then check each line for 'n' white spaces (or possibly a tab char). Not sure how reliable this will be though as the details part may contain similar white space.
Are there any other markers than can be used ie is it only certain types of log that can be multiline?
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Based on the examples you've shown it would appear multi-line logs include dhcp,debug,packet after the time and date and single line ones don't.
Then within a multi-line log the dhcp,debug,packet is followed by a single space for the first line and 5 spaces for the other lines (or are they tabs ?).
Can you use this information to identify multi-line log entries ?

Basically you need to identify some pattern that distinguishes single line entries from multiple line entries and then another pattern to distinguish the first line of a multiple line entry.
 
Winston Gutkowski
Bartender
Posts: 10573
65
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
tom davies wrote:Could someone help me out with a simple solution as i think i have over thought the problem and now am drawing a blank

Well, as Joanne says, you need to find some pattern that distinguishes a first line from any other; and just looking at your sample, it would appear that all first lines start with:
"MMM/dd/yyyy HH:mm:ss " (note the trailing space)
(in SimpleDateFormat terms)

However, depending on content, you might be unlucky enough to run into a line that just happens to start with a date in the same format.

One thing that a lot of text parsers (eg, shell script interpreters) do is to have a line continuation marker. In bash, for example, it's backslash; so any line ending with a "\" is assumed to be continued on the next line.

Winston
 
tom davies
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In all the log files i have the dhcp lines which contain the information are the only ones with multiple spaces so i could check the whitespace on the line.
I have just added a check that says if(line.contains(" ");
Is that a good way to do it or should i find an alternative? I just put a prntln in to check and it seems to work, i will try and get it to recognise multiple log entries now
 
Ulf Dittmer
Rancher
Posts: 42972
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If this was my problem I'd use a lexer to deal with this (something like JFlex). Such formats have a tendency to become more complicated over time, and maintaining code that parses it "manually" becomes a headache quickly. If I may strut my own stuff, here's a little writeup I did on how to use JFlex: http://www.javaranch.com/journal/2008/04/Journal200804.jsp#a4
 
tom davies
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I will take a look into lexer. I had the parser working using sample log files containing different dhcp types.
The next step i am trying to add this dhcp parser into a complete log parser that will include methods to parse the other log types.
I have encountered two problems. 1 is that my date/time are wrong, i think i am recording the date from the next log entry and not the current one.
The main problem is that i cant get my head around how to find the end of a particular log. Currently i am working that out in the dhcp method which is fine if the only logs are dhcp but if the dhcp method is not called, at the end of a file for example, then i do not get that entry stored. I have a feeling the date/time problem will be sorted when i figure out how to find the end of an entry.
If i had a method to go one step forward, check the next entry for log type and/or dhcp type then go back again and store the previous entry if the next was a different log it may work. There is no way to do this with scanner though. Any advice is much appreciated.

My current parser (sorry if its a bit long, but it shows how my parser currently works):

 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
tom davies wrote:If i had a method to go one step forward, check the next entry for log type and/or dhcp type then go back again and store the previous entry if the next was a different log it may work.

Have two variables to hold the current and next entries. After you process each entry you discard the current one, make the current entry variable point to what was the next entry and then read the next entry and point the next entry variable at it.
 
tom davies
Ranch Hand
Posts: 168
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks, i seem to have something that is working (mostly). In the recieveElements() and sendingElements() methods i have added an if statement to check if the next line is of the same log type and dhcp type and also if there is a next line. This will then store the log if the next is not the same log/dhcp type or there is no next line. This seems to work fine but i think i have some logic error in my main parse loop. I keep missing off the last log entry because i am not checking the last line. I cant seem to see my error though, maybe you can.

 
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!