Enterprise Java

Search for time interval in logs

Intro

This post is indirectly related to my mini-series about log analysis. It would be great to read the two main parts to better understand what I’m talking about. Part 1Part 2.

This post describes one important problem I ran into while implementing the IDE approach.

Task Description

When someone is working with logs usually he or she needs to investigate only one time interval. The available logs usually span days but the time interval that must be investigated is 1-2 hours. The task is to select all log records within that time interval.

Basic Log Record Regex

In order to select a log record we need a regular expression that matches any log record. For the simple log4j format like

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message

I found the following regex:

TIME_REGEX((?!(TIME_REGEX)).*\r?\n)*

This regular expression matches both single-line and miltiline log records. Time regex could be

\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d

So if somebody wanted to load all logs into a text window he could open the log files one by one and use Matcher.find() with this regex to get all log records.

This regex is based on the fact that the time regex pattern is never repeated in the body of the log message which is true in 99% of all cases.

Datetime of the Log Record

In order to search for a specific time interval and use other features it makes sense to extract the dtaetime information from the log record. Thankfully this task has been solved by JDK with DateTimeFormatter. It is enough to specify the format for the log type and the date can be extracted. For example for the log record above the format is

yyyy-MM-dd HH:mm:ss,SSS

As soon as we can extract the datetime information we can specify the interval as datetime values not Strings in some specific format.

Search Time

Now that we have found a way to select any log record and extract the date information from it the path forward seems clear:

  • specify the interval,
  • select the records one by one
  • extract the date information from the log record
  • compare the datetime with the interval
  • if the datetime is within the interval add this record to the List of found records
  • after searching through all files show the found records

There is one big issue with this approach:
time. With 50 log files 50 MB each it will take hours to scan all them to find 10 MB of records in the interval.

Solution

We can use one trick to filter out the files that do not contain a single record in the interval. We use the fact that the log records in the log files are written one after the other. This means the time of the next record is equal or after the time of this record. For example only 2 situations are possible:

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message
2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2

Or

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message
2018-08-10 11:00:56,278 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2

I rarely saw some examples where under high load the log records can go in reverse but the difference is in milliseconds. We can consider this difference insignificant for our purpose.

This means if neither the first nor the last record in the file are not in the interval all the records in the file are not in the interval and this file can be filtered out. Java regular expressions have special constructs to find the first and the last records.

The first record:

\ATIME_REGEX((?!(TIME_REGEX)).*\r?\n)*

The last record:

TIME_REGEX((?!(TIME_REGEX)).*\r?\n)*\Z

\A means the beginning of the text, \Z means the end of the text. You can find more details in the javadocs for java.util.regex.Pattern.

The solution is to use a special prescanning technique. Before scanning the whole text of a log file find the first and last records and if non of them is in the interval skip the file. Of the 50 files maybe 1-2 needs to be scanned.

Conclusion

REAL uses this technique to speed up searches for a datetime interval. I found that it takes approximately 5-10 seconds to decide if the file must be skipped. Most of the time if spent executing Matcher.find() for the last record. The first record is found much faster. I think it is possible to speed it up even further by selecting the last 5 MB of a 50 MB file to search for the last record. But even in the current state it is fast enough.

Published on Java Code Geeks with permission by Vadim Korkin, partner at our JCG program. See the original article here: Search for time interval in logs

Opinions expressed by Java Code Geeks contributors are their own.

Vadim Korkin

Vadim is a senior software engineer with lots of experience with different software technologies including Java, Javascript, databases (Oracle and Postgres), HTML/CSS and even machine learning. He likes learning new technologies and using the latest libraries in his work. Vadim also has some personal projects on Github
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button