A Developer-Friendly Way to Analyse Log Files

04/03/2020 — Windows, Linux, WSL, Ubuntu, AWK, IIS, Log Processing, Scripting, Tutorial — 14 min read

It’s 4 pm on a Friday afternoon. You’re sitting at your workstation, attempting to finish off development of that feature you’ve been working on all week. You’ve got your headphones on with your favourite music playing to block out all the Friday afternoon shenanigans. The feature is almost done and you’re looking forward to begin your weekend. Suddenly, you feel a light tap on your shoulder. It’s your boss! She needs your help to get a breakdown of all successful/failed incoming requests to the company website for the last week. Reluctantly, you decide to jump on to the server and start downloading the log files for the last week. There are at least a dozen files and while they’re downloading, you begin to search for the best way to extract the data you need.

How would you go about doing this? There are many ways of course. You could import the log files into a database and run queries against it. You could also download software like Microsoft's Log Parser and run SQL-like queries to make sense of your logs. What about using Microsoft Excel? Sure, that works too.

Today, we are not going to use any of the above methods. Instead, we are going to attempt to extract meaningful data from an IIS log file using a developer-friendly approach. (Psst...this approach should work for extracting data out of any type of text file, and not just IIS log files.)

Now, the approach in question uses a Linux command to help with extracting data, so developers who work in the Linux world are probably very familiar with what I am about to demonstrate. However, if you've been living in the world of Windows or you are new to Linux, then keep reading.

Before we go any further, if you would like to follow along with the commands in this post, download this sample IIS log file. We will be using this file as our test log file. Here is a snippet of its contents:

1#Software: Microsoft Internet Information Services 10.0
2#Version: 1.0
3#Date: 2017-11-18 00:01:07
4#Fields: date time cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
52017-11-18 08:48:20 GET /de adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 301 0 0 624 543 46
62017-11-18 08:48:20 GET /de/ adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 200 0 0 12973 544 62
72017-11-18 11:45:11 GET /global/lwb.min.js - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 200 0 0 2429 473 15
82017-11-18 11:45:11 GET /global/img/body_bg.png - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 304 0 0 232 477 33
92017-11-18 11:45:11 GET /global/img/content_bg_top.png - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 304 0 0 232 484 33
102017-11-18 11:45:11 GET /global/img/menu_spacer.png - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 304 0 0 232 481 35

To WSL or not to WSL...

That is definitely a question, and the answer depends on whether you are reading this post on Linux or Windows. Linux users can skip this section and move on to the next. If you are on Windows (hopefully a 64-bit installation of Windows 10 version 1607 or higher 🤞🏻), you will need to install the Windows Subsystem for Linux (WSL). Follow these instructions to enable WSL and install your preferred Linux distribution on Windows. If you are unsure which one to choose, just install Ubuntu 18.04 LTS from the Microsoft Store. That's the one I am using. So go ahead, install that and come back when you're ready. Don't worry, I'll wait for you. I promise!

Meet `'awk'`

Now that you have WSL and Ubuntu (or any other Linux distribution) installed and set up, it is time to meet the star of the show, awk.

The Unix/Linux command awk lets you run scripts written in the AWK programming language to process lines in one or more files. The AWK script consists of a series of condition action statement pairs. Each line in the file is tested against the condition (which can be either a regex pattern or an regular conditional statement) and if the condition evaluates to true, the corresponding action block is executed. Lines are also automatically split up into fields and you can have these conditions evaluated against individual fields on a line rather than the whole line. In other words, awk lets you process text files line-by-line allowing you to extract only the data you need from it or to produce formatted reports. Now, before we get into the detail of how this command works, let me give you a brief history lesson.

A w`awk` down memory lane

AWK the programming language was first designed and created in 1977 at Bell Labs by Alfred Aho, Peter Weinberger and Brian Kernighan. The name of the language is derived from their surnames (AWK = Aho Weinberger Kernighan). In the mid to late 1980s, AWK was expanded significantly and that resulted in the GNU AWK implementation (or gawk for short) which was released in 1988. This implementation of AWK is the most widely deployed version and is what you can find bundled into Ubuntu.

B`awk` to the future

Now that we've safely parked the DeLorean in the garage, let's open up a Ubuntu terminal window so you can execute the commands shown below and follow along. You should be able to find the 'Ubuntu' application in your start menu on Windows or via the Windows search bar. Once open, make sure you change your working directory to the directory where you saved the above sample log file.

Alright, so here is the basic syntax for invoking awk:

awk [options] 'program-text' file

You can type man awk into your Ubuntu terminal to see the official documentation for the awk/gawk command including all the option flags you can pass in to the command.

Okay, so let's jump straight into writing some simple commands. The simplest command we can write is to retrieve and display all the lines in our sample file which looks like this:

awk '{print $0}' u_ex171118-sample.txt

#Software: Microsoft Internet Information Services 10.0
#Version: 1.0
#Date: 2017-11-18 00:01:07
#Fields: date time cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken  2017-11-18 08:48:20 GET /de adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 301 0 0 624 543 46
2017-11-18 08:48:20 GET /de/ adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 200 0 0 12973 544 62
2017-11-18 11:45:11 GET /global/lwb.min.js - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 200 0 0 2429 473 15
2017-11-18 11:45:11 GET /global/img/body_bg.png - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 304 0 0 232 477 33     2017-11-18 11:45:11 GET /global/img/content_bg_top.png - 443 - 87.185.206.252 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:57.0)+Gecko/20100101+Firefox/57.0 _ga=GA1.2.573603466.1510956966;+_gid=GA1.2.622072548.1510956966 https://translate.google.com/ www.site-logfile-explorer.com 304 0 0 232 484 33
...

The above command fetches each line from the sample file and feeds it into the program text {print $0} which prints the contents of the $0 variable to standard out. $0 in this case is a predefined numbered variable that awk sets up automatically for you and it always contains the line that is currently being processed. That's handy! However, that's not the only automatic variable that awk sets up. Apart from giving you the whole line in $0, awk also splits up the line into fields and makes those available as numbered fields too. It puts the first field into $1, the second into $2, the third in $3 and so on. So, if you only wanted to print out the URL field (the fourth field) of each line, then your command would look like this:

awk '{print $4}' u_ex171118-sample.txt

Information
cs-method
/de
/de/
/global/lwb.min.js
/global/img/body_bg.png
/global/img/content_bg_top.png
/global/img/menu_spacer.png
/global/img/logo.jpg
/global/img/css_valid.png
...

Notice how there are some blank lines printed. This is happening because our sample file has some metadata at the top. Some of these metadata lines don't have a fourth field (i.e. the $4 variable is empty), so they show up as blank lines. Later, I will show you how to skip the top 4 lines of the file and stop them from being processed. Let's move on for now.

We can also print more than one field, so if we wanted to also print the http status code along with the URL, we could do so like this:

awk '{print $4,$14}' u_ex171118-sample.txt

Information
cs-method cs-host
/de 301
/de/ 200
/global/lwb.min.js 200
/global/img/body_bg.png 304
/global/img/content_bg_top.png 304
/global/img/menu_spacer.png 304
/global/img/logo.jpg 304
/global/img/css_valid.png 304
...

This is great! We've been able to retrieve only the fields in the file that are useful to us. However, this script is not very useful as it outputs all the URLs in the file. Our test file is only a sample, but imagine if it had thousands of lines (as log files usually do). It wouldn't be very useful to just print out every URL. It would make sense to filter out unwanted lines and only retrieve lines that interest us. So, how would we do that?

Remember how I explained that an AWK script is basically a series of condition action statement pairs? Well, all the commands we've written so far have only had actions without conditions. This means that the actions were executed for every line. In order for us to only print the URLs of certain lines while skipping the ones we don't want, we have to add a condition to our script text. Conditions are written at the beginning of the script text just before the opening curly brace ({) character. So let's modify the above script to only return URLs and status codes for URLs that begin with '/global/img'.

awk '/\/global\/img/ {print $4,$14}' u_ex171118-sample.txt

/global/img/body_bg.png 304
/global/img/content_bg_top.png 304
/global/img/menu_spacer.png 304
/global/img/logo.jpg 304
/global/img/css_valid.png 304
...

We just added the regular expression /\/global\/img/ to our script. Regex patterns must start and end with a forward slash / character. Since the forward slash character has a special meaning in this case (i.e. to denote the start and end of a regex pattern) we have to escape it when we want to have it match a forward slash character in the URL, hence the \/ parts of the regex. This regex pattern is telling awk to execute the print $4,$14 action for all lines that have /global/img in them. Notice how we are not telling awk which "field" to test using the regex pattern. In this case, awk will default to testing the regex pattern against the whole line, which is stored in the $0 variable. This condition can be rewritten as $0 ~ /\/global\/img/ and it would still mean the same thing. To prove that ommitting the field variable causes awk to default to the $0 variable, let's modify the script above slightly.

awk '/^\/global\/img/ {print $4,$14}' u_ex171118-sample.txt
# Notice that we have added the caret character (^) to the beginning of the regex pattern.
# This tells awk to look for lines that start with '/global/img'.
# This command retruns no output as there are no lines that start with '/global/img'.
# All the lines start with '2017'.
# This means that awk is evaluating the regex against the whole line.

Okay, but what if we wanted to have our regex tested against a certain field and not the whole line. We may want to do this because there may be another part of the line that has the same pattern we are trying to match. We wouldn't want awk to accidentally return lines we didn't want. Well, you can actually tell awk which field to use when testing for the regex pattern. I actually hinted at how we could do this when I mentioned earlier that having a regex pattern without a specified field can also be rewritten as $0 ~ /\/global\/img/. We can just use this same format for a regex condition and change the $0 variable to the numbered field variable we want to test against.

awk '$4 ~ /^\/global\/img/ {print $4,$14}' u_ex171118-sample.txt
# We have added '$4 ~' just before the regex.

/global/img/body_bg.png 304
/global/img/content_bg_top.png 304
/global/img/menu_spacer.png 304
/global/img/logo.jpg 304
/global/img/css_valid.png 304
...

The script now reads like this: if the contents of the fourth field ($4) of a line matches (~) the regex pattern (^\/global\/img) then execute the action (print $4,$14).

Earlier, I mentioned that the condition part of the script could be a regex pattern or a regular conditional statement. Let's see how we can use a standard conditional instead. Let's say we didn't want to match on a pattern and instead we wanted to do an exact match on the status field. Here is the command to retrieve all the URLs that have a status code of 404:

awk '$14 == 404 {print $4,$14}' u_ex171118-sample.txt

/contact/contact.aspxx 404
/wp-includes/wlwmanifest.xml 404
/xmlrpc.php 404
/blog/wp-includes/wlwmanifest.xml 404
/wordpress/wp-includes/wlwmanifest.xml 404
/wp/wp-includes/wlwmanifest.xml 404
/site/wp-includes/wlwmanifest.xml 404
/cms/wp-includes/wlwmanifest.xml 404

This command is similar to the previous one, except that we have changed the condition to a more standard boolean expression.

So far, we have only really retrieved lines and fields based on a condition. What if we wanted to do some calculations and reporting on the data. One of the reports you may want to retrieve from a log file is a breakdown of the http status codes. How many successful requests do we have. Are people experiencing any 404 or 500 errors? Let's write a script that will give us this breakdown.

Before we get into it, I have to introduce you to a couple of AWK concepts. BEGIN/END blocks and Arrays.

BEGIN and END blocks

BEGIN and END blocks within an AWK script. Basically, you can write a BEGIN and/or an END action in your scripts. The BEGIN action will execute once before any lines are processed and the END action will be executed once at the end after all the lines have been processed. This is especially handy if you have to do some set up like assigning variables before you start processing the file and also if you need to do some clean up or output of information at the end. You can also use either a BEGIN block or an END block or both. Here is what the basic syntax for BEGIN and END blocks look like:

awk 'BEGIN { [begin-actions] } { [actions-for-each-line] } END { [end-actions] }' file

Arrays in AWK

Also, we will be using an array to store the total number of requests per status code. Arrays in AWK don't need to be declared or initialised. They are also not like arrays in many other languages. They are not fixed length and are associative, meaning that they are stored as key-value pairs. You can use any number or string as an array index/key and you can mix string and number values into the same array. A new element can also be added to the array at any time. Think of AWK arrays more like dictionaries rather than the fixed-length, single-typed arrays in languages like C# or Java.

Now that we have discussed the above concepts, here is what our breakdown script looks like.

awk '{ breakdown[$14]++ } END { for(code in breakdown) { print code, breakdown[code] } }' u_ex171118-sample.txt

When you run the above command, you get the following output:

Okay so that output kinda looks like what we want, except that there are a couple of extra lines that we were not expecting. This happened because the first 4 lines of the file are metadata and should really be ignored. I told you earlier that I would show you how to ignore them. We could do this by initialising a line counter variable in a BEGIN block and then incrementing it for each line we process and then only adding to the breakdown array if that count variable is greater than 4, but AWK already does this for you. There is a built-in variable in AWK called NR which is an acronym for Number of Records. In other words, it stores the current number of the line being processed. So when processing the first line, NR will have a value of 1, a value of 2 for the second line and so on. With this in mind, all we have to do is add a condition before the action body to tell AWK to only process lines greater than 4 (i.e. line 5 onwards). You can see that in the script below.

awk 'NR>4 { breakdown[$14]++ } END { for(code in breakdown) { print code, breakdown[code] } }' u_ex171118-sample.txt

You can see now that our output only has HTTP status codes along with the number of occurrences of each. Sweet!

There are a number of other handy built-in variables in AWK that allow you to get or set the input and output field separator characters (FS/OFS), the input and output record separators (RS/ORS), the number of fields in the current line (NF), the current filename (FILENAME) and the current number of records relative to the current file being processed (FNR). You can change the values of these variables at any time in your AWK program and it will retain that value for the rest of the program's execution. There are many other predefined and built-in variables in AWK. You can read more about them in the GNU Awk User's Guide linked in the References section at the end of this post.

AWK script files

Up until now, we have been writing all our AWK scripts inline. This is fine for simple little scripts, but as soon as you write a slightly more complex script with BEGIN/END blocks (like we did with the breakdown script above), it becomes hard to understand and debug things when you have everything in a single line. Thankfully, you can move the script out to an external file and then point to the file while running your command. You will also be able to format your script better in an external file and it makes it easy to share the script files around. Let's change our breakdown script above to use an external file instead.

Create a new file called 'breakdown.awk' in the same directory where you saved the sample log file we've been using. Copy the following code into the file and save it. This is the same program we wrote above, but just placed in a separate file and formatted nicely to make it more readable and maintainable.

breakdown.awk

1NR>4 {
2    breakdown[$14]++
3}
4END {
5    for(code in breakdown) {
6        print code, breakdown[code]
7    }
8}

Now we can run the command below and we should get the same output as before.

awk -f breakdown.awk u_ex171118-sample.txt

Multiple condition action pairs

AWK scripts can also have multiple condition action pairs in the same file. This is handy if you want to do a different action for some lines compared to others. For example, we could add another condition action pair if we wanted to print out the URLs of all the entries with a 404 status code in addition to showing the breakdown numbers like we do at the moment. Here is how we can achieve that:

breakdown.awk

1$14 == 404 {
2    print $4
3}
4NR>4 {
5    breakdown[$14]++
6}
7END {
8    for(code in breakdown) {
9        print code, breakdown[code]
10    }
11}

Re-run the awk command and you should see output like this:

awk -f breakdown.awk u_ex171118-sample.txt

/contact/contact.aspxx
/wp-includes/wlwmanifest.xml
/xmlrpc.php
/blog/wp-includes/wlwmanifest.xml
/wordpress/wp-includes/wlwmanifest.xml
/wp/wp-includes/wlwmanifest.xml
/site/wp-includes/wlwmanifest.xml
/cms/wp-includes/wlwmanifest.xml
200 38
301 3
304 93
404 8

What's happening here is that AWK is actually testing every line against both conditions in the file. For most lines, only the NR>4 condition matches and hence only that condition's related action is executed. However, for the lines where the status code is 404, both the $14 == 404 AND NR>4 conditions are met and therefore both those action bodies are executed. That is why you get the URLs for the 404 status AND those 404 status lines also get counted as part of the breakdown while the other non-404 lines only get added to the breakdown. Brilliant!

User-defined functions in AWK

Function definitions can appear anywhere in between, before or after the condition action pairs in an AWK program and don't need to appear at the top or before all uses of the function. The reason is that awk loads the entire script in before performing any executions. Here is the basic syntax of a function:

1function funcName([parameter-list])
2{
3    body-of-function
4}

Let's update our breakdown script above and create a function to print the breakdown in the END block. Also, just to prove that awk does load in the whole program and that we can in fact define our function after its usage, let's add the function to the end of the file. Here is what that would look like:

breakdown.awk

1$14 == 404 {
2    print $4
3}
4NR>4 {
5    breakdown[$14]++
6}
7END {
8    printArray(breakdown)
9}
10
11function printArray(anArray)
12{
13    for(arrIndex in anArray)
14    {
15        print arrIndex, anArray[arrIndex]
16    }
17}

If we re-run our awk command, you should still see the same output as before. We were able to extract the for loop and print statement into a user-defined function which we then called from the END block. As you can see, our function is defined after its usage. This is great because we can now write re-usable functions to process text and make our AWK programs more readable.

Running AWK scripts on multiple files

The awk command allows you to run your scripts on multiple files. So far, we've been extracting data out of a single sample log file. In reality, we may have multiple log files we want to process with the same script and get a unified result across all the files. We can do this by simply passing in additional file paths after the first file path like this:

awk -f breakdown.awk u_ex171118-sample.txt log2.txt log3.txt log4.txt

Each file will be processed line-by-line in the order that they are passed in and you will end up with an overall breakdown of status codes across all the files (assuming all the other files are in the same format as the sample log file).

Processing other types of text files

Everything I have shown you thus far helps you extract data out of an IIS log file. This type of log file has one entry per line (a record) and uses the space character to separate individual bits of information within a single line (a field). The way awk processes these log files is by having an understanding of how these fields and records are stored within this log file. By default, awk assumes that each line is a record and knows when one line ends and the next one begins by looking for a newline character (\n). This is the default value of the RS (record separator) built-in variable within AWK. Additionally, AWK also knows how to break up a single line into individual fields. In this case, awk uses the space character by default to determine when one field ends and the next one begins. The built-in variable FS (field separator) is what controls how awk splits up a line/record into fields.

By changing these two variables, you can easily configure AWK to read a text file with a different format to the sample log file we've been using. For example, if you needed to process a CSV file, you would simply leave the RS variable set to the newline character, but change the FS variable to the comma (,) character. AWK would then treat each line as a new record (just like our sample log file), but each line would be split into fields using the comma character. You can do this in one of two ways as shown below.

Option 1:

Assuming our sample log file had all the same fields in the same order, but was a CSV-formatted file, then here is how we could invoke our breakdown.awk script to process the CSV log file.

awk -F , -f breakdown.awk u_ex171118-sample.txt

All we had to do was add the -F , option to the awk command. This option tells the awk command to change the value of the FS variable within our AWK script to the comma character.

Option 2:

You can also change the value of the FS variable at any point throughout the execution of the program. Below is an example where we are changing the FS variable to a comma character in a BEGIN block before any CSV lines are processed.

1BEGIN {
2    FS = ","
3}

Tip: If you want to change the record separator, you can ONLY do so by updating the RS variable within your script (like we did with the FS variable in option 2 above). There is no option to pass in a different record separator character at the command line.

Conclusion

This blog post was inspired by a youtube video I watched recently (linked in the References section below). It occurred to me that with the launch of WSL (and soon WSL 2), Windows developers now have access to a much wider range of new commands and tools which they might not know much about. I wrote this blog post to share what I had learnt about one such handy Unix/Linux command. As you've seen above, awk is an incredibly powerful and convenient tool to have in your tool belt. It reads in text files line by line, breaks up each line into fields and lets you write simple yet powerful scripts to process and extract meaningful data out of them. What I have shown you here is just a brief introduction to awk and I hope it has piqued your interest. AWK is a really powerful language and has many more advance features and built-in functions. If you are keen to learn more, I would highly recommend diving into The GNU Awk User's Guide that I have linked to in the References section below. I hope you found this post useful and that you'll start using awk to process your logs, CSVs or any other text files you can find.

Thanks for reading. That's all for now and see you in the next one!

References

The GNU Awk User's Guide - https://www.gnu.org/software/gawk/manual/gawk.html
This blog post was inspired by the "EVERYONE Needs to Learn a Little Bit of AWK!" video on the Gary Explains Youtube channel.
Sample IIS file copied from https://www.site-logfile-explorer.com/logfile-samples/iis.aspx
AWK on Wikipedia - https://en.wikipedia.org/wiki/AWK