Introduction
The nginx access log stores information about all requests that users of your website generate. It grows big very quickly, and it's hard to make any sense out of it without additional tools. We will feed all the data from the nginx access.log
to an SQLite database for further analysis. I will show you how to use SQL to group and filter available data. For example, you can easily find out which pages of your website are most popular and from which sources you receive the most traffic. This approach can be considered as a lightweight and privacy-friendly alternative to client-side analytics such as Google Analytics.
Prerequisites
-
nginx is installed on your server.
How to install nginx is explained in this tutorial.
-
An understanding of the nginx configuration.
Use the official documentation as needed.
-
An Ubuntu OS on your local PC and on your server.
If you're using another distribution or operating system, adjust commands accordingly.
-
Basic SQL and JavaScript knowledge.
JavaScript will be used to parse nginx access log, and feed the data into the SQLite database. You don't need to understand it much, as I provide a ready-to-use script. SQL will be used to query the SQLite database which contains the access log records.
Example terminology
- Username:
holu
- Server IPv4:
10.0.0.1
Please replace 10.0.0.1
with an IP address of your server, and holu
with a username of your sudo user.
Step 1 - Installing required software
On your local PC:
You need to install Node.js and its package manager npm to proceed. You can use this instruction to install Node.js on Ubuntu. The Node.js package in the official Ubuntu repositories can be out of date.
Now create a directory for this tutorial:
mkdir ~/query-nginx-logs
Use the following command to install required packages:
Package | Purpose |
---|---|
nginxparser |
Will be used to parse the nginx access log file. |
better-sqlite3 |
It's needed to feed the parsed log records to the SQLite database. |
cd ~/query-nginx-logs && npm install nginxparser better-sqlite3
The command above will create the node_modules
directory and place installed packages in there.
SQLite will be used to query the nginx logs. SQLite is chosen for simplicity reasons, it doesn't need any configuration, because it doesn't have a server. Install it with the following command:
sudo apt update && sudo apt install sqlite3
Step 2 - Determine the nginx log format
On the remote server:
The nginx log_format
directive specifies which data will be logged and how it will be formatted. If you haven't configured the log format in the nginx configuration file (usually /etc/nginx/nginx.conf
), nginx uses the default one, which is called combined
and looks like this:
Note: You must use double quotes around variables that may contain spaces.
log_format combined '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
On the local PC, we will use the JavaScript library to parse the log files. For this, step 4 will explain how to create the necessary JavaScript file. In this JavaScript file, you need to provide the log format used by nginx in a slightly adjusted format to be syntactically valid JavaScript.
In the JavaScript file in step 4, the nginx combined
log format will look like this (converted to JavaScript):
const logFormat = '$remote_addr - $remote_user [$time_local] ' +
'"$request" $status $bytes_sent ' +
'"$http_referer" "$http_user_agent"';
Basically, all you need to do is to copy-paste the format string and insert the +
operator between single quoted strings. If you're using another log format, convert it in the same way.
Step 3 - Copy the access log to your local PC
You probably want to inspect the nginx access log file on your local PC, instead of directly on the remote server.
The tricky part is that nginx access log file is owned by www-data
with strict permissions, and you can't copy it directly using the scp
command if your root user is disabled. To solve this, perform the steps below.
On the remote server:
This adds your user to the special group adm
, which allows you to read the log files, but not to write them.
sudo adduser $USER adm
On your local PC:
You can now copy the access log from your server to the query-nginx-logs
directory created in step 1.
- The path
/var/log/nginx/access.log
is usually used by default. You can check theaccess_log
directive in your nginx configuration on the remote server to determine where the logs are saved. You could also check by running this command on your remote server:sudo nginx -T | grep -E 'log_format|access_log'
- Replace
holu
with your username. - Replace
10.0.0.1
with an IP or hostname of your server.
scp holu@10.0.0.1:/var/log/nginx/access.log ~/query-nginx-logs
Step 3.1 - Combine logs into one big file (Optional)
The nginx access logs on Ubuntu are rotated using the special utility called logrotate
. By default, this is done daily. You may find it inconvenient that logs are scattered among multiple files, if you want to analyze them.
On the remote server:
Use the command below to combine all the access logs into one big file big-access.log.gz
and save it to your home directory on a server.
cd /var/log/nginx &&
(cat access.log.*.gz ; cat access.log access.log.1 | gzip) \
> ~/big-access.log.gz 2>/dev/null
On your local PC:
Next, you can copy this file to your local PC:
scp holu@10.0.0.1:big-access.log.gz ~/query-nginx-logs
and decompress it:
cd ~/query-nginx-logs && gunzip big-access.log.gz
You can use this file for further processing in the next step.
Note: This file will probably be huge, for example, my nginx server generated 500 MB worth of logs and almost 2 000 000 log records.
Step 4 - Create a JavaScript file to parse the nginx access log
On your local PC:
Create the parse
file:
cd ~/query-nginx-logs && nano parse
and paste the following content:
#!/usr/bin/env node
const NginxParser = require('nginxparser');
const db = require('better-sqlite3')('logs.db');
const logFormat = '$remote_addr - $remote_user [$time_local] ' +
'"$request" $status $bytes_sent ' +
'"$http_referer" "$http_user_agent"';
const parser = new NginxParser(logFormat);
if (process.argv.length < 3) {
console.log("Usage: ./parse access.log");
process.exit(1);
}
let firstRow = true;
let statement;
parser.read(process.argv[2], function (row) {
if (firstRow) {
let cols = Object.keys(row);
db.exec(`DROP TABLE IF EXISTS logs;
CREATE TABLE logs (${cols.join(",")});
BEGIN TRANSACTION;`);
statement = db.prepare(`INSERT INTO logs VALUES (${'?' + ', ?'.repeat(cols.length - 1)})`)
firstRow = false;
}
statement.run(...Object.values(row));
}, function (err) {
if (err) throw err;
db.exec(`COMMIT;`)
});
Add execute permissions to the file, to be able to run it without node
prefix, i.e. ./parse
:
chmod +x parse
Step 5 - Parse the nginx access log and create an SQLite database
On your local PC:
Run the following command and pass the path to the access log file, which you got from your server.
cd ~/query-nginx-logs && ./parse access.log
The command above should create an SQLite database logs.db
in the query-nginx-logs
directory, which will be used to perform SQL queries.
The SQLite CLI utility sqlite3
installed previously, will be used to perform SQL queries interactively in the shell.
Step 6 - Using SQL to query the database
On your local PC:
Open the database by running SQLite shell and passing the path to the database as a first argument:
cd ~/query-nginx-logs && sqlite3 logs.db
The interactive shell starts with a prompt: sqlite>
, where you can type in your SQL statements and special SQLite-specific commands that start with a dot.
First of all, you can check which columns are available for your queries, not all of them are filled with values though.
You can use the special SQLite command:
.schema
The output will be similar to this:
CREATE TABLE logs (
msec,
time_iso8601,
remote_addr,
query_string,
http_x_forwarded_for,
http_user_agent,
http_referer,
time_local,
request,
status,
request_time,
request_length,
pipe,
connection,
bytes_sent,
body_bytes_sent,
date,
timestamp,
ip,
ip_str,
remote_user
);
All variables which are present in the log_format
directive in the nginx configuration
will be present in the database table as a column with the same name.
Let's perform some queries against the database.
First, change the output mode, to make a result of the query formatted as a table:
.mode table
This output mode is more readable than the default one.
Step 6.1 - Get the most popular requests (pages)
Which pages on your website are the most popular ones? Run the following SQL query to find out:
SELECT request, COUNT(*) as count
FROM logs
GROUP BY request
ORDER BY count DESC
LIMIT 100;
This will output up to 100 most popular requests. The output will be similar to this:
+----------------------------------------------+-------+
| request | count |
+----------------------------------------------+-------+
| GET /the-most-popular-page HTTP/1.1 | 1011 |
+----------------------------------------------+-------+
| GET /second-most-popular-page HTTP/1.1 | 105 |
+----------------------------------------------+-------+
| GET /robots.txt HTTP/1.1 | 50 |
+----------------------------------------------+-------+
| GET / HTTP/1.1 | 14 |
+----------------------------------------------+-------+
You can use additional filtering to remove the static files (like: .jpg
, .js
, .css
, etc) from the results.
SELECT request, COUNT(*) as count
FROM logs
WHERE request NOT LIKE '%.jpg HTTP/1.1' AND
request NOT LIKE '%.png HTTP/1.1' AND
request NOT LIKE '%.js HTTP/1.1' AND
request NOT LIKE '%.css HTTP/1.1'
GROUP BY request
ORDER BY count DESC
LIMIT 100;
Step 6.2 - Get the requests that resulted in the 404 error
It may be useful to check if there are unexpected 404 errors returned to your users.
SELECT DISTINCT request
FROM logs
WHERE status = '404';
Step 6.3 - Find out which bots are accessing your website
Use the query below to find out which bots are crawling your website and how many requests they generate. It may help to decide which ones to block.
SELECT http_user_agent as bot, COUNT(*) as requests
FROM logs
WHERE http_user_agent LIKE '%Bot%'
GROUP BY http_user_agent
ORDER BY requests DESC;
Example output:
+--------------------------------------------------------------+----------+
| bot | requests |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/ | 155 |
| ) | |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/r | 88 |
| obot/) | |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/b | 30 |
| ots) | |
+--------------------------------------------------------------+----------+
| Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrus | 29 |
| h.com/bot.html) | |
+--------------------------------------------------------------+----------+
| DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) | 22 |
+--------------------------------------------------------------+----------+
Step 6.4 - Check out the sources of your traffic
The HTTP Referer request header may contain the source of the request, it can be used to find out from where the traffic is going to your website.
You can group requests based on the http_referer
value and return up to 10 most popular sources like this:
SELECT http_referer as referer, COUNT(*) as requests
FROM logs
WHERE http_referer != ''
GROUP BY http_referer
ORDER BY requests DESC
LIMIT 10;
Example output:
+-----------------------------------------+----------+
| referer | requests |
+-----------------------------------------+----------+
| https://www.bing.com/ | 440 |
+-----------------------------------------+----------+
| https://www.google.com/ | 70 |
+-----------------------------------------+----------+
| https://images.search.yahoo.com/ | 20 |
+-----------------------------------------+----------+
Step 7 - Logging user's country
On the remote server:
Prerequisites
-
Install and enable the
geoip2
module for nginx to detect the user's country.How to do it is explained in this tutorial. You need to install the
geoip2
module, enable it in your nginx configuration, and download the IP database. Do not add thegeoip2
andmap
directive from that tutorial to your nginx configuration, they are not needed here.
Now edit the nginx configuration file (e.g. /etc/nginx/nginx.conf
). To start logging the user's country, you need to create a variable $user_country
which will store it. Next, you define a custom log format in your nginx configuration and add the $user_country
variable there. Now your custom log format should be added to your access_log
directive. In the result, your configuration will look like this (the rest of the configuration is omitted):
http {
log_format customformat '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'"$user_country"';
geoip2 /etc/nginx/GeoLite2-Country.mmdb {
$user_country country names en;
}
access_log /var/log/nginx/access.log customformat;
...
}
After you applied your changes, don't forget to reload the nginx configuration:
sudo systemctl reload nginx
Try to access your website now, to generate some log entries in your access.log
. After that, run the following command:
sudo tail /var/log/nginx/access.log
The log entries must contain the user's country at the end of the line, for example:
203.0.113.1 - - [09/Aug/2024:13:33:33 +0200] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36" "Germany"
If it's not the case, get back to your nginx configuration, you probably made some mistake.
On your local PC, the JavaScript file parse
needs to be adjusted to support the updated log format. Change the logFormat
constant in it like this:
const logFormat = '$remote_addr - $remote_user [$time_local] ' +
'"$request" $status $bytes_sent ' +
'"$http_referer" "$http_user_agent" ' +
'"$user_country"';
Again, it's just the nginx log format with an +
operator between the single quoted strings.
Now go back to step 3 to copy the log file and repeat step 5 for the new log file.
Use the following SQL query to check from which countries users are accessing your website:
Use
sqlite3 logs.db -cmd '.mode table'
to run the SQLite shell.
SELECT user_country AS country, COUNT(*) AS requests
FROM logs
GROUP BY user_country
ORDER BY requests DESC;
Example output:
+----------------+----------+
| country | requests |
+----------------+----------+
| United States | 10200 |
| Germany | 1517 |
+----------------+----------+
Conclusion
As you can see, SQL is a powerful language and allows you to extract useful information from your nginx access logs, even huge ones. Hopefully you gathered some useful information about your website visitors by using this method.
You can speed up queries by creating an index (e.g. CREATE INDEX request_index ON logs (request);
to create an index for the request column). The queries which use an index will execute faster at the price of higher disk usage by the logs.db
database.