Get Rewarded! We will reward you with up to €50 credit on your account for every tutorial that you write and we publish!

Analyze nginx access log using SQL

profile picture
Author
wpdevelopment11
Published
2024-08-16
Time to read
13 minutes reading time

Introduction

The nginx access log stores information about all requests that users of your website generate. It grows big very quickly, and it's hard to make any sense out of it without additional tools. We will feed all the data from the nginx access.log to an SQLite database for further analysis. I will show you how to use SQL to group and filter available data. For example, you can easily find out which pages of your website are most popular and from which sources you receive the most traffic. This approach can be considered as a lightweight and privacy-friendly alternative to client-side analytics such as Google Analytics.

Prerequisites

  • nginx is installed on your server.

    How to install nginx is explained in this tutorial.

  • An understanding of the nginx configuration.

    Use the official documentation as needed.

  • An Ubuntu OS on your local PC and on your server.

    If you're using another distribution or operating system, adjust commands accordingly.

  • Basic SQL and JavaScript knowledge.

    JavaScript will be used to parse nginx access log, and feed the data into the SQLite database. You don't need to understand it much, as I provide a ready-to-use script. SQL will be used to query the SQLite database which contains the access log records.

Example terminology

  • Username: holu
  • Server IPv4: 10.0.0.1

Please replace 10.0.0.1 with an IP address of your server, and holu with a username of your sudo user.

Step 1 - Installing required software

On your local PC:

You need to install Node.js and its package manager npm to proceed. You can use this instruction to install Node.js on Ubuntu. The Node.js package in the official Ubuntu repositories can be out of date.

Now create a directory for this tutorial:

mkdir ~/query-nginx-logs

Use the following command to install required packages:

Package Purpose
nginxparser Will be used to parse the nginx access log file.
better-sqlite3 It's needed to feed the parsed log records to the SQLite database.
cd ~/query-nginx-logs && npm install nginxparser better-sqlite3

The command above will create the node_modules directory and place installed packages in there.

SQLite will be used to query the nginx logs. SQLite is chosen for simplicity reasons, it doesn't need any configuration, because it doesn't have a server. Install it with the following command:

sudo apt update && sudo apt install sqlite3

Step 2 - Determine the nginx log format

On the remote server:

The nginx log_format directive specifies which data will be logged and how it will be formatted. If you haven't configured the log format in the nginx configuration file (usually /etc/nginx/nginx.conf), nginx uses the default one, which is called combined and looks like this:

Note: You must use double quotes around variables that may contain spaces.

log_format combined '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

On the local PC, we will use the JavaScript library to parse the log files. For this, step 4 will explain how to create the necessary JavaScript file. In this JavaScript file, you need to provide the log format used by nginx in a slightly adjusted format to be syntactically valid JavaScript.

In the JavaScript file in step 4, the nginx combined log format will look like this (converted to JavaScript):

const logFormat = '$remote_addr - $remote_user [$time_local] ' + 
                  '"$request" $status $bytes_sent ' +
                  '"$http_referer" "$http_user_agent"';

Basically, all you need to do is to copy-paste the format string and insert the + operator between single quoted strings. If you're using another log format, convert it in the same way.

Step 3 - Copy the access log to your local PC

You probably want to inspect the nginx access log file on your local PC, instead of directly on the remote server.

The tricky part is that nginx access log file is owned by www-data with strict permissions, and you can't copy it directly using the scp command if your root user is disabled. To solve this, perform the steps below.

On the remote server:

This adds your user to the special group adm, which allows you to read the log files, but not to write them.

sudo adduser $USER adm

On your local PC:

You can now copy the access log from your server to the query-nginx-logs directory created in step 1.

  • The path /var/log/nginx/access.log is usually used by default. You can check the access_log directive in your nginx configuration on the remote server to determine where the logs are saved. You could also check by running this command on your remote server:
    sudo nginx -T | grep -E 'log_format|access_log'
  • Replace holu with your username.
  • Replace 10.0.0.1 with an IP or hostname of your server.
scp holu@10.0.0.1:/var/log/nginx/access.log ~/query-nginx-logs

Step 3.1 - Combine logs into one big file (Optional)

The nginx access logs on Ubuntu are rotated using the special utility called logrotate. By default, this is done daily. You may find it inconvenient that logs are scattered among multiple files, if you want to analyze them.

On the remote server:

Use the command below to combine all the access logs into one big file big-access.log.gz and save it to your home directory on a server.

cd /var/log/nginx &&
(cat access.log.*.gz ; cat access.log access.log.1 | gzip) \
> ~/big-access.log.gz 2>/dev/null

On your local PC:

Next, you can copy this file to your local PC:

scp holu@10.0.0.1:big-access.log.gz ~/query-nginx-logs

and decompress it:

cd ~/query-nginx-logs && gunzip big-access.log.gz

You can use this file for further processing in the next step.

Note: This file will probably be huge, for example, my nginx server generated 500 MB worth of logs and almost 2 000 000 log records.

Step 4 - Create a JavaScript file to parse the nginx access log

On your local PC:

Create the parse file:

cd ~/query-nginx-logs && nano parse

and paste the following content:

#!/usr/bin/env node

const NginxParser = require('nginxparser');
const db = require('better-sqlite3')('logs.db');

const logFormat = '$remote_addr - $remote_user [$time_local] ' + 
                  '"$request" $status $bytes_sent ' +
                  '"$http_referer" "$http_user_agent"';

const parser = new NginxParser(logFormat);

if (process.argv.length < 3) {
    console.log("Usage: ./parse access.log");
    process.exit(1);
}

let firstRow = true;
let statement;
parser.read(process.argv[2], function (row) {
    if (firstRow) {
        let cols = Object.keys(row);
        db.exec(`DROP TABLE IF EXISTS logs;
                 CREATE TABLE logs (${cols.join(",")});
                 BEGIN TRANSACTION;`);

        statement = db.prepare(`INSERT INTO logs VALUES (${'?' + ', ?'.repeat(cols.length - 1)})`)
        firstRow = false;
    }
    statement.run(...Object.values(row));
}, function (err) {
    if (err) throw err;
    db.exec(`COMMIT;`)
});

Add execute permissions to the file, to be able to run it without node prefix, i.e. ./parse:

chmod +x parse

Step 5 - Parse the nginx access log and create an SQLite database

On your local PC:

Run the following command and pass the path to the access log file, which you got from your server.

cd ~/query-nginx-logs && ./parse access.log

The command above should create an SQLite database logs.db in the query-nginx-logs directory, which will be used to perform SQL queries.

The SQLite CLI utility sqlite3 installed previously, will be used to perform SQL queries interactively in the shell.

Step 6 - Using SQL to query the database

On your local PC:

Open the database by running SQLite shell and passing the path to the database as a first argument:

cd ~/query-nginx-logs && sqlite3 logs.db

The interactive shell starts with a prompt: sqlite>, where you can type in your SQL statements and special SQLite-specific commands that start with a dot. First of all, you can check which columns are available for your queries, not all of them are filled with values though.

You can use the special SQLite command:

.schema

The output will be similar to this:

CREATE TABLE logs (
    msec,
    time_iso8601,
    remote_addr,
    query_string,
    http_x_forwarded_for,
    http_user_agent,
    http_referer,
    time_local,
    request,
    status,
    request_time, 
    request_length,
    pipe,
    connection,
    bytes_sent,
    body_bytes_sent,
    date,
    timestamp,
    ip,
    ip_str,
    remote_user
);

All variables which are present in the log_format directive in the nginx configuration will be present in the database table as a column with the same name.

Let's perform some queries against the database.

First, change the output mode, to make a result of the query formatted as a table:

.mode table

This output mode is more readable than the default one.

Which pages on your website are the most popular ones? Run the following SQL query to find out:

SELECT request, COUNT(*) as count 
FROM logs 
GROUP BY request
ORDER BY count DESC
LIMIT 100;

This will output up to 100 most popular requests. The output will be similar to this:

+----------------------------------------------+-------+
|                    request                   | count |
+----------------------------------------------+-------+
| GET /the-most-popular-page HTTP/1.1          | 1011  |
+----------------------------------------------+-------+
| GET /second-most-popular-page HTTP/1.1       | 105   |
+----------------------------------------------+-------+
| GET /robots.txt HTTP/1.1                     | 50    |
+----------------------------------------------+-------+
| GET / HTTP/1.1                               | 14    |
+----------------------------------------------+-------+

You can use additional filtering to remove the static files (like: .jpg, .js, .css, etc) from the results.

SELECT request, COUNT(*) as count 
FROM logs
WHERE request NOT LIKE '%.jpg HTTP/1.1' AND
      request NOT LIKE '%.png HTTP/1.1' AND
      request NOT LIKE '%.js HTTP/1.1' AND
      request NOT LIKE '%.css HTTP/1.1'
GROUP BY request
ORDER BY count DESC
LIMIT 100;

Step 6.2 - Get the requests that resulted in the 404 error

It may be useful to check if there are unexpected 404 errors returned to your users.

SELECT DISTINCT request
FROM logs
WHERE status = '404';

Step 6.3 - Find out which bots are accessing your website

Use the query below to find out which bots are crawling your website and how many requests they generate. It may help to decide which ones to block.

SELECT http_user_agent as bot, COUNT(*) as requests 
FROM logs
WHERE http_user_agent LIKE '%Bot%'
GROUP BY http_user_agent
ORDER BY requests DESC;

Example output:

+--------------------------------------------------------------+----------+
|                             bot                              | requests |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/ | 155      |
| )                                                            |          |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/r | 88       |
| obot/)                                                       |          |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/b | 30       |
| ots)                                                         |          |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrus | 29       |
| h.com/bot.html)                                              |          |
+--------------------------------------------------------------+----------+
| DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)   | 22       |
+--------------------------------------------------------------+----------+

Step 6.4 - Check out the sources of your traffic

The HTTP Referer request header may contain the source of the request, it can be used to find out from where the traffic is going to your website. You can group requests based on the http_referer value and return up to 10 most popular sources like this:

SELECT http_referer as referer, COUNT(*) as requests 
FROM logs
WHERE http_referer != ''
GROUP BY http_referer
ORDER BY requests DESC
LIMIT 10;

Example output:

+-----------------------------------------+----------+
|                 referer                 | requests |
+-----------------------------------------+----------+
| https://www.bing.com/                   | 440      |
+-----------------------------------------+----------+
| https://www.google.com/                 | 70       |
+-----------------------------------------+----------+
| https://images.search.yahoo.com/        | 20       |
+-----------------------------------------+----------+

Step 7 - Logging user's country

On the remote server:

Prerequisites

  • Install and enable the geoip2 module for nginx to detect the user's country.

    How to do it is explained in this tutorial. You need to install the geoip2 module, enable it in your nginx configuration, and download the IP database. Do not add the geoip2 and map directive from that tutorial to your nginx configuration, they are not needed here.

Now edit the nginx configuration file (e.g. /etc/nginx/nginx.conf). To start logging the user's country, you need to create a variable $user_country which will store it. Next, you define a custom log format in your nginx configuration and add the $user_country variable there. Now your custom log format should be added to your access_log directive. In the result, your configuration will look like this (the rest of the configuration is omitted):

http {
        log_format customformat '$remote_addr - $remote_user [$time_local] '
                                '"$request" $status $body_bytes_sent '
                                '"$http_referer" "$http_user_agent" '
                                '"$user_country"';

        geoip2 /etc/nginx/GeoLite2-Country.mmdb {
            $user_country country names en;
        }
        access_log /var/log/nginx/access.log customformat;
...
}

After you applied your changes, don't forget to reload the nginx configuration:

sudo systemctl reload nginx

Try to access your website now, to generate some log entries in your access.log. After that, run the following command:

sudo tail /var/log/nginx/access.log

The log entries must contain the user's country at the end of the line, for example:

203.0.113.1 - - [09/Aug/2024:13:33:33 +0200] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36" "Germany"

If it's not the case, get back to your nginx configuration, you probably made some mistake.

On your local PC, the JavaScript file parse needs to be adjusted to support the updated log format. Change the logFormat constant in it like this:

const logFormat = '$remote_addr - $remote_user [$time_local] ' + 
                  '"$request" $status $bytes_sent ' +
                  '"$http_referer" "$http_user_agent" ' +
                  '"$user_country"';

Again, it's just the nginx log format with an + operator between the single quoted strings.

Now go back to step 3 to copy the log file and repeat step 5 for the new log file.

Use the following SQL query to check from which countries users are accessing your website:

Use sqlite3 logs.db -cmd '.mode table' to run the SQLite shell.

SELECT user_country AS country, COUNT(*) AS requests
FROM logs
GROUP BY user_country
ORDER BY requests DESC;

Example output:

+----------------+----------+
|     country    | requests |
+----------------+----------+
| United States  | 10200    |
| Germany        | 1517     |
+----------------+----------+

Conclusion

As you can see, SQL is a powerful language and allows you to extract useful information from your nginx access logs, even huge ones. Hopefully you gathered some useful information about your website visitors by using this method.

You can speed up queries by creating an index (e.g. CREATE INDEX request_index ON logs (request); to create an index for the request column). The queries which use an index will execute faster at the price of higher disk usage by the logs.db database.

License: MIT
Want to contribute?

Get Rewarded: Get up to €50 in credit! Be a part of the community and contribute. Do it for the money. Do it for the bragging rights. And do it to teach others!

Report Issue
Try Hetzner Cloud

Get €20/$20 free credit!

Valid until: 31 December 2025 Valid for: 3 months and only for new customers
Get started
Want to contribute?

Get Rewarded: Get up to €50 credit on your account for every tutorial you write and we publish!

Find out more