2016/01/30

Monitoring a file to get details about write operations

Today I've been answering questions on unix.stackexchange.com, when I've met one which at first sight seemed easy to answer, but after rethinking it a few times I realized that actually I didn't know the proper answer (That's the question in question). This made me investigate and led me down a rabbit hole which consumed most of this afternoon, and the next one, and some more time. The question asks about how to figure out which process is writing to a certain log file. The person asking can see that something is writing to it by reading the file, but she doesn't know which process is writing it. Obviously there are some mechanisms in Syslog which could solve that problem if the file is written by Syslog, but out of curiosity I wanted to know if there's a generic mechanism that works for any file, so I'm going to go through a few possible solutions without concluding on a best one.

Inotify

One of the first things that came to my mind when I read the question was Inotify. Inotify is a mechanism that's provided by the Linux kernel via a set of system calls, it can be used to set watchers on files and directories. If one of the watched files gets modified in a way that the watcher is waiting for then an event gets passed to the user space process. This is great, I can use it to put a watcher on a file and make it wait for anything writing to that file. There are many different wrappers around the Inotify API, one of my favorite ones is a CLI utility called inotifywatch which comes with the package inotify-tools.
mst@mst-nb1:~$ inotifywait 123
Setting up watches.
Watches established.    # now i write to "123" in another terminal
123 MODIFY
Ok, so we got notified that there was a write event on the file, but we still don't know what has written to it. Unfortunately the Inotify API does not give any hints regarding which process has done a given modification on a file. Considering that the reading of the events in user space happens asynchronously this makes sense, because at the time the event is processed by the user space application the writing process might already have exited. The alternative would be to copy a lot of information about the process into the Inotify event structure, but this would increase it's size drastically and would hence decrease the performance.
For reference, this is the structure that describes an Inotify event:
# include/uapi/linux/inotify.h
...
 14 /*
 15  * struct inotify_event - structure read from the inotify device for each event
 16  *
 17  * When you are watching a directory, you will receive the filename for events
 18  * such as IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, ..., relative to the wd.
 19  */
 20 struct inotify_event {
 21         __s32           wd;             /* watch descriptor */
 22         __u32           mask;           /* watch mask */
 23         __u32           cookie;         /* cookie to synchronize two events */
 24         __u32           len;            /* length (including nulls) of name */
 25         char            name[0];        /* stub for possible name */
 26 };
...

Lsof

My next choice when trying to figure out what's writing to a file is the lsof utility. lsof is great to see any kind of status information about running processes and the file descriptors they're currently holding open.
# In terminal 1 I create a cat process which holds an open file descriptor 
# to /tmp/abc and sits there waiting for input
mst@mst-nb1:~$ cat > /tmp/abc

# In terminal 2 I run lsof on /tmp/abc
mst@mst-nb1:~$ lsof /tmp/abc
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
cat     18420  mst    1w   REG  252,1        0 11010067 /tmp/abc
The example shows that lsof correctly displays that there's a cat process holding file descriptor with the number 1 in write mode and that this file descriptor refers to the file /tmp/abc.
This is great, it shows us what we want to know. But what if the process that modifies the file only runs for 0.01 seconds and then exits or closes the file again? Even when applying the maximum of my typing skills while my finger joints are at room temperature I'd still have a good chance to miss it. Due to the way lsof collects it's data there is no way to watch a file for modifications over a time span, it can only capture snapshots. It's repeat mode, when passing the -r option, doesn't help us much because it only captures multiple snapshots at an interval. Looks like we still haven't found a really good solution to our problem.

LoggedFS

Let's get a little more exotic and venture into the land of non-standard utilities. Loggedfs is a fuse based file system, that means it runs in user space, and it greatly increases the verbosity of all operations running on the file system while still letting the "real" file system do it's job underneath. Luckily the Ubuntu repositories provide it as a package under the name loggedfs, so it's easy to install. That's how that looks like:
# In terminal 1 I'm mounting Loggedfs on an existing directory while
# running it in foreground to directly see the output
mst@ubuntu:~$ loggedfs -f /tmp/testdir
09:30:27 (src/loggedfs.cpp:678) LoggedFS not running as a daemon
09:30:27 (src/loggedfs.cpp:818) LoggedFS starting at /tmp/testdir.
09:30:27 (src/loggedfs.cpp:843) chdir to /tmp/testdir

# In terminal 2 I'm writing to one of the already existing files
mst@ubuntu:~$ echo "bla" > /tmp/testdir/abc

# Back in terminal 1 Loggedfs shows me all the details about what just happened
 09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) open writeonly /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) truncate /tmp/testdir/abc to 0 bytes \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) write 4 bytes to /tmp/testdir/abc at offset 0 \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) 4 bytes written to /tmp/testdir/abc at offset 0 \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
This is great, the output shows us exactly which process has written to our file and we could hardly wish for more details. But there are still some things to consider:
  • Mounting only works with an entire directory, it does not work on single files, so exclude and include patterns have to be used to specify which file exactly needs to be monitored.
  • Depending on company policy it might be problematic to install non-default packages on production systems just for troubleshooting.
  • Some people might dislike the idea of having to run fuse on a production box.
With LoggedFS we've found a really nice and workable solution which tells us what we wanted to know. But actually the Kernel already has all the information that we want, so why would we have to run something else on top of it to collect the same data a second time? Let's see if by digging down into the Kernel we can't find some mechanism that doesn't require external packages or user space tools and still provides us with all the information we are looking for.

SystemTap

When starting to look at the more advanced kernel features it gets interesting, I'll start with SystemTap. SystemTap is a utility that originally comes from RedHat, but it's Open Source and usable on all major distros. It makes it easy to generate simple kernel modules which can be used to inspect what's happening in the kernel by creating event handlers for various types of events.
Besides the ability to attach event handlers to most kernel functions which are exported as symbols, SystemTap comes with a large collection of "tapsets". Tapsets are predefined scripts and events that can be used in custom tap scripts. One of those available events is called vfs.write. vfs stands for Virtual File System and it's basically an abstraction layer on top of the actual file system implementation to handle generic operations like opening, closing, writing and reading files so they don't need to be reimplemented by each file system implementation again and again.
Once SystemTap is installed, the common way to use it is by writing script files with the mime type stp which are then read by the stap utility. This utility translates them into C code, compiles them as a kernel module and then loads them into the kernel. The structure of the scripts is usually rather simple, they can set probes on events and they define functions to make code reusable.

Identifying what we're looking for

In our case we want to monitor the write operations on a given file, so first we'll need to figure out how to identify a specific file in a vfs.write event. When querying the stap utility for the available variables provided by the event vfs.write we can see that there is a file variable.
root@ubuntu:~# stap -L vfs.write
vfs.write file:long pos:long buf:long bytes_to_write:long dev:long devname:string \
    ino:long name:string argstr:string $file:struct file* $buf:char const* \
    $count:size_t $pos:loff_t*
Alternatively we can also just look at the definition of vfs.write to get even more details.
# /usr/local/systemtap/share/systemtap/tapset/linux/vfs.stp
...
 950 probe vfs.write = kernel.function("vfs_write")
 951 {
 952         file = $file
 953         pos = $pos
 954         buf = $buf
 955         bytes_to_write = $count
 956         dev = __file_dev($file)
 957         devname = __find_bdevname(dev, __file_bdev($file))
 958         ino = __file_ino($file)
 959 
 960         name = "vfs.write"
 961         argstr = sprintf("%d, %d, %p", $count, $pos, $buf)
 962 }
...
But how to uniquely identify a file based on the available variables? Each file on the file system is associated with an inode number and each inode number is unique within it's file system, so within a file system we can use the inode number to identify a file. To identify the file system we can use the associated device's major and minor numbers, this means that the combination of ma/mi numbers plus the inode number should be unique in the whole system and give us a reliable identifier that we can use to identify any file.
In the kernel the major and minor numbers are merged using a bit shift operation and a logical or, the result is a device identifier which the vfs.write event provides as variable dev. The inode number is available as variable ino. One fact that will be important in a later chapter is that all of that data gets extracted from the file variable by the two functions __file_dev and __file_ino.
In order to be able to look for a certain value in the dev variable we need to reproduce the merge of major and minor numbers that the kernel does internally, luckily SystemTap comes with a function called MKDEV which does exactly that.

Let's try that!

Now we have all the information we need to write our stap script:
# write.stp
#
# monitor writes on file
probe vfs.write
{
    if (dev == MKDEV($1,$2) && ino == $3) {
       printf("got a write from %s (%d)\n", execname(), pid());
    }
}
The $1, $2 and $3 are arguments which must be provided on the command line when running the script.
root@ubuntu:~# touch /tmp/testfile

# get inode number of testfile
root@ubuntu:~# ls -i /tmp/testfile
14 /tmp/testfile

# getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where /tmp/testfile resides.
root@ubuntu:~# sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:0

# pass major, minor and inode numbers to stap script
root@ubuntu:~# stap ./write.stp 252 0 14

# on terminal 2 we write something into the testfile
root@ubuntu:~# echo "test" > /tmp/testfile

# back on terminal 1 we're getting output:
root@ubuntu:~# stap ./write.stp 252 0 14
got a write from bash (868)
Great, we're monitoring the file for write operations and for each occurring operation we can get the process name and pid of the process that's writing! SystemTap, with all the tapsets that it ships with, really makes investigating kernel internals relatively easy and this by itself is quite a feature. It also doesn't require you to install it's entire tool chain on every system that you want to use it on because after SystemTap built your kernel module on one machine you can copy the module to every other machine with a compatible kernel and load it there. Unfortunately it can currently still be slightly troublesome to set the tool chain up on Debian based distributions like Ubuntu and it may require a small amount of hacking to make it work, on RedHat it should work out of the box though.
Seems we've found another pretty good solution for our problem, but I wish there was an easy way to achieve our goal of monitoring the writes to a file without having to load a custom kernel module on a production server... Let's try Auditd instead.

Auditd

Just like SystemTap, Auditd originated from RedHat but is Open Source and usable on other distributions too. It consists of several parts: The data collection happens inside the Kernel via a bunch of hooks placed at critical locations, then the collected data is first filtered through a set of filters that live inside the Kernel space, the resulting data is sent to a user space daemon (auditd) via a netlink socket (man 7 netlink). auditd then communicates with other user space processes which want to access the collected data or modify filter rules, it also maintains log files containing the collected data for other tools to read. If user space processes submit commands to modify the filter rules to auditd it uses the same netlink socket to configure the in-kernel data filters. The RedHat documentation illustrates that architecture nicely.
The easiest way to setup filter rules is by using the CLI utility auditctl, which comes together with auditd. When reading man 8 auditctl it becomes obvious that this is a quite mature and powerful tool to configure a wide range of settings and filters, but let's first focus on our rather simple task of watching writes on a file. Here are the relevant parts of auditctl's man page:
...
 -w path
    Insert a watch for the file system object at path. You cannot insert a watch to  the
    top  level  directory. This is prohibited by the kernel. Wildcards are not supported
    either and will generate a warning. The way that watches work  is  by  tracking  the
    inode  internally. If you place a watch on a file, its the same as using the -F path
    option on a syscall rule. If you place a watch on a directory, its the same as using
    the -F dir option on a syscall rule. The -w form of writing watches is for backwards
    compatibility and the syscall based form is more  expressive.  Unlike  most  syscall
    auditing  rules, watches do not impact performance based on the number of rules sent
    to the kernel. The only valid options when using a watch are the -p and -k.  If  you
    need  to  anything  fancy  like audit a specific user accessing a file, then use the
    syscall auditing form with the path or dir fields. See the EXAMPLES section  for  an
    example of converting one form to another.

...
 -k key Set a filter key on an audit rule. The filter key is an  arbitrary  string  of  text
    that can be up to 31 bytes long. It can uniquely identify the audit records produced
    by a rule. Typical use is for when you have several rules that  together  satisfy  a
    security requirement. The key value can be searched on with ausearch so that no mat‐
    ter which rule triggered the event, you can find its results. The key  can  also  be
    used on delete all (-D) and list rules (-l) to select rules with a specific key. You
    may have more than one key on a rule if you want to be able to search logged  events
    in  multiple ways or if you have an audispd plugin that uses a key to aid its analy‐
    sis.
...
 -p [r|w|x|a]
    Describe the permission access type that  a  file  system  watch  will  trigger  on.
    r=read,  w=write, x=execute, a=attribute change. These permissions are not the stan‐
    dard file permissions, but rather the kind of syscall that would  do  this  kind  of
    thing.  The  read  & write syscalls are omitted from this set since they would over‐
    whelm the logs. But rather for reads or writes, the open flags are looked at to  see
    what permission was requested.
...
Let's try that with the file /tmp/testfile that we've already used in the SystemTap test. First I'll set up a rule that places a watcher on writes to that file, then I'll write into it, and finally I'll use the ausearch utility to search through the audit log files for the key I've defined in the watcher rule. The ausearch utility gets delivered as part of the auditd package.
# Create a rule to watch writes on /tmp/testfile and mark them
# with the key "testfile"
mst@mst-nb1:~$ sudo auditctl -w /tmp/testfile -p w -k testfile

# Verify that the rule has been added
mst@mst-nb1:~$ sudo auditctl -l
-w /tmp/testfile -p w -k testfile

# Write into the test file
mst@mst-nb1:~$ echo "test" >> /tmp/testfile

# Search log files for the key "testfile"
mst@mst-nb1:~$ sudo ausearch -k testfile
----
time->Sat Feb  6 15:52:28 2016
type=CONFIG_CHANGE msg=audit(1454745148.692:36790): auid=4294967295 ses=4294967295 \
    op="add_rule" key="testfile" list=4 res=1
----
time->Sat Feb  6 15:52:36 2016
type=PROCTITLE msg=audit(1454745156.808:36794): proctitle="bash"
type=PATH msg=audit(1454745156.808:36794): item=1 name="/tmp/testfile" inode=11012788 \
    dev=fc:01 mode=0100777 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PATH msg=audit(1454745156.808:36794): item=0 name="/tmp/" inode=11010049 \
    dev=fc:01 mode=041777 ouid=0 ogid=0 rdev=00:00 nametype=PARENT
type=CWD msg=audit(1454745156.808:36794):  cwd="/home/mst"
type=SYSCALL msg=audit(1454745156.808:36794): arch=c000003e syscall=2 success=yes \
    exit=3 a0=2298f88 a1=441 a2=1b6 a3=7ffe7bb3eb48 items=2 ppid=30721 pid=32347 \
    auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 \
    sgid=1000 fsgid=1000 tty=pts2 ses=4294967295 comm="bash" exe="/bin/bash" \
    key="testfile"
Nice, we got exactly the output that we were looking for again! The auditd way of creating those watcher rules is really amazingly powerful, user friendly and easy to use. Personally I think that the auditd utility should be at the front of every operations guy's utility belt, somewhere between tcpdump and strace.
I could be satisfied with all the already discovered solutions here, but I'm still curious if there isn't any way to get that data without having to use any user space utilities and just directly surgically extract them from the Kernel. That's where Ftrace and Kprobe come in.

Ftrace & Kprobe

Ftrace is a Linux Kernel internal tracer that captures information about the current context when certain events occur and conditions are met, it exports it's control interfaces via a virtual file system called tracefs which is usually mounted in the directory tracing of the debugfs file system. By default most distributions mount debugfs on /sys/kernel/debug so Ftrace's files are in /sys/kernel/debug/tracing. From now on, when I refer to paths of tracefs files, their paths should always be interpreted as relative to the mount point of tracefs. When browsing through that directory it might at first seem overwhelming due to the large number of files and directories, but luckily the kernel comes with a really great documentation.

Basic event tracking with Ftrace

The list of available tracers is in the file /available_tracers, but since we only want some basic information and do not need any advanced tracing mechanisms the simple nop tracer is sufficient for us. It can be selected by writing it's name into the file /current_tracer.
root@ubuntu:~# cat /sys/kernel/debug/tracing/available_tracers 
blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
root@ubuntu:~# echo -n "nop" > /sys/kernel/debug/tracing/current_tracer
Before we start capturing anything let's first make sure the tracing is disabled and then clear the current trace output, that way we won't have any other output cluttering our trace output. The tracing can be enabled/disabled by writing 1/0 into the file /tracing_on. The trace output can be cleared by truncating the file /trace, which will later contain our trace output.
root@ubuntu:~# echo -n "0" > /sys/kernel/debug/tracing/tracing_on 
root@ubuntu:~# echo -n "" > /sys/kernel/debug/tracing/trace
Ftrace knows a large number of events and it allows us to choose which events we're interested in. The events are shown to the user in two ways, as a directory hierarchy in the directory /events and as a long list in the file /available_events. Because I like the hierarchic structure that the /events directory uses I'll stick to that one for now.
On the first level of directories in /events there is one sub directory per type of event, since we're looking for write operations I'll first focus on the system calls in /events/syscalls. Doing an ls on /events/syscalls shows that for each system call there is one event when entering the call and one when exiting it, so let's look at the content of /events/syscalls/sys_enter_write.
root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/ 
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept4
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_access
...
root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/sys_enter_write
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
-r--r--r-- 1 root root 0 Jan 26 13:47 format
-r--r--r-- 1 root root 0 Jan 26 13:47 id
-rw-r--r-- 1 root root 0 Jan 26 13:47 trigger
Each of the event directories contains the same set of files and unsurprisingly enable is used to disable/enable the tracing of this event. Let's first disable the tracing of all events and then only enable that one that we're interested in. For convenience the upper level directories /events/syscalls and /events also contain a file enable which can be used to propagate a change through all the sub directories. The file /set_event always contains a list of the currently enabled events (and it could also be used to control the enabling/disabling of events) so we read that file to verify that now the only selected event is sys_enter_write.
root@ubuntu:/sys/kernel/debug/tracing# echo -n "0" > events/enable 
root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > events/syscalls/sys_enter_write/enable 
root@ubuntu:/sys/kernel/debug/tracing# cat set_event 
syscalls:sys_enter_write
Seems we're ready to start tracing, but we don't want to get flooded with data, so we'll only enable the tracing for about 5 seconds. Then we should be able to see the output of the trace in the file /trace and depending on how busy your system is there might be quite a lot of data.
root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > tracing_on; sleep 5; echo -n "0" > tracing_on
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 108/108   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-1261  [000] ....  3360.093713: sys_write(fd: 3, buf: 7f7e25a22de0, count: 24)
            sshd-1198  [000] ....  3360.699635: sys_write(fd: a, buf: 7f568d193d90, count: 1)
            bash-1284  [000] ....  3360.699682: sys_write(fd: 2, buf: 76c008, count: 1)
            sshd-1198  [000] ....  3360.699699: sys_write(fd: 3, buf: 7f568d178680, count: 24)
            sshd-1198  [000] ....  3360.778897: sys_write(fd: a, buf: 7f568d193d90, count: 1)
            bash-1284  [000] ....  3360.778941: sys_write(fd: 2, buf: 76c008, count: 1)
Cool, so we get some data there! Unfortunately we don't know what that is, because the only file identifier given is the fd for file descriptor. Since file descriptors are only unique in process context we can't use them to filter for the specific file we're interested in because we don't know which process is writing to that file, that's what we're trying to find out! To solve that problem we can use Kprobe to define custom events and variables to filter by.

Custom events with Kprobe

Kprobes work by placing breakpoint instructions at specific places in the kernel and then collect data associated to the current context, if you want to know all the tiny details about how it works then the kernel documentation shall satisfy your curiosity. Defined Kprobes can be used by Ftrace just like any other event and they're controlled using the same files as described above, but the important feature that we need right now is that Kprobe allows us to dereference function arguments in order to later filter by members of structures in those arguments. This will help us to filter out events on a specific file.
Due to the flexibility of Kprobe we can get information about basically anything that happens in the kernel, we just need to know what we're looking for and how to filter it out of all the other stuff happening. Let's revert back to the function vfs_write that we've already used with SystemTap to monitor writes. If SystemTap is able to retrieve the criteria we require from the function arguments of vfs_write, then there should be some way how we can do the same with Kprobe if we just figure out where that data resides.
Looking at the kernel code, we can see in the function definition of vfs_write that the file variable is of type struct file.
# fs/read_write.c:547
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
You might remember from the SystemTap example above that the vfs.write tap extracted all the information about the inode number and the associated device's major and minor numbers from the file argument, so we need to inspect the kernel code that defines the involved structures to figure out where that data is.
# include/linux/fs.h
...
 584 struct inode {
 585         umode_t                 i_mode;
 586         unsigned short          i_opflags;
 587         kuid_t                  i_uid;
 588         kgid_t                  i_gid;
 589         unsigned int            i_flags;
 590 
 591 #ifdef CONFIG_FS_POSIX_ACL
 592         struct posix_acl        *i_acl;
 593         struct posix_acl        *i_default_acl;
 594 #endif
 595 
 596         const struct inode_operations   *i_op;
 597         struct super_block      *i_sb;
 598         struct address_space    *i_mapping;
 599 
 600 #ifdef CONFIG_SECURITY
 601         void                    *i_security;
 602 #endif
 603 
 604         /* Stat data, not accessed from path walking */
 605         unsigned long           i_ino;
...
 839 struct file {
 840         union {
 841                 struct llist_node       fu_llist;
 842                 struct rcu_head         fu_rcuhead;
 843         } f_u;
 844         struct path             f_path;
 845         struct inode            *f_inode;       /* cached value */
...
1289 struct super_block {
1290         struct list_head        s_list;         /* Keep this first */
1291         dev_t                   s_dev;          /* search index; _not_ kdev_t */
...
What we see in those struct definitions is that the inode number is available from file->f_inode->i_ino as a plain unsigned long and the major and minor numbers are accessible as file->f_inode->i_sb->s_dev as type dev_t. The kernel uses macros to convert dev_t typed variables into major and minor numbers and back, we'll come back to that in a minute.

Defining the filter criteria by using custom variables

The kprobetrace documentation comes with a section describing the syntax used to define kprobe events and their custom variables via tracefs.
 23 Synopsis of kprobe_events
 24 -------------------------
 25   p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]  : Set a probe
 26   r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]             : Set a return probe
 27   -:[GRP/]EVENT                                         : Clear a probe
 28 
 29  GRP            : Group name. If omitted, use "kprobes" for it.
 30  EVENT          : Event name. If omitted, the event name is generated
 31                   based on SYM+offs or MEMADDR.
 32  MOD            : Module name which has given SYM.
 33  SYM[+offs]     : Symbol+offset where the probe is inserted.
 34  MEMADDR        : Address where the probe is inserted.
 35 
 36  FETCHARGS      : Arguments. Each probe can have up to 128 args.
 37   %REG          : Fetch register REG
 38   @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
 39   @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
 40   $stackN       : Fetch Nth entry of stack (N >= 0)
 41   $stack        : Fetch stack address.
 42   $retval       : Fetch return value.(*)
 43   +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
 44   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
 45   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
 46                   (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
 47                   are supported.
 48 
 49   (*) only for return probe.
 50   (**) this is useful for fetching a field of data structures.
Since we are going to have to access nested members of struct file we'll need to know their offsets relative to their containing structs. The C standard library has a really useful macro called offsetof which can tell us exactly that, but since code running in the kernel can't access the standard library we'll need to use the kernel's own implementation of offsetof which is in include/linux/stddef.h. Obviously, if you're less lazy than I am, then it's also possible to just do the offset calculations in your head. But the lazy-people's way to get them is by writing a super minimalist kernel module that prints the offsets into dmesg as debug messages:
  1 #include <linux/module.h>
  2 #include <linux/fs.h>
  3 #include <linux/kernel.h>
  4 #include <linux/init.h>
  5 #include <linux/printk.h>
  6 #include <linux/stddef.h>
  7
  8 static int __init get_offset_init(void)
  9 {
 10         pr_debug("offset of f_inode in struct file is %ld", offsetof(struct file, f_inode));
 11         pr_debug("offset of i_ino in struct inode is %ld", offsetof(struct inode, i_ino));
 12         pr_debug("offset of i_sb in struct inode is %ld", offsetof(struct inode, i_sb));
 13         pr_debug("offset of s_dev in struct super_block is %ld", offsetof(struct super_block, s_dev));
 14 
 15         return 0;
 16 }
 17 
 18 static void __exit get_offset_cleanup(void)
 19 {
 20 }
 21 
 22 module_init(get_offset_init);
 23 module_exit(get_offset_cleanup);
Once built and loaded it shows us what we need to know in the kernel's debug output.
root@ubuntu:~/get_offsetof# insmod ./offsetof.ko
root@ubuntu:~/get_offsetof# dmesg
[19280.887103] offset of f_inode in struct file is 32
[19280.887106] offset of i_ino in struct inode is 64
[19280.887108] offset of i_sb in struct inode is 40
[19280.887109] offset of s_dev in struct super_block is 16
Since we are going to calculate the offsets relative to the pointer value of the first function argument (struct file *file) the address to add the offsets to should be in the register %di, just as this StackOverflow post nicely describes.
Using the above quoted syntax documentation we can now define custom variables for our Kprobe event that extract i_ino and s_dev from struct file by writing it's definition into the file kprobe_events.
# inode is at file->f_inode->i_ino with offsets:
#   f_inode: 32
#   i_ino: 64
# inode=+64(+32(%di)):u64
#
# device is at file->f_inode->i_sb->s_dev with offsets:
#   f_inode: 32
#   i_sb: 40
#   s_dev: 16
# dev=+16(+40(+32(%di))):u32
#
# Defining the kprobe event and the custom variables by
# adding the above calculated offsets to the address
# in %di
root@mst-nb1:/sys/kernel/debug/tracing# \
    echo "p:kprobes/vfs_write vfs_write inode=+64(+32(%di)):u64 dev=+16(+40(+32(%di))):u32" \
    > ./kprobe_events

# The defined event appears in the /events folder and it's
# format file shows the custom variables
root@mst-nb1:/sys/kernel/debug/tracing# cat ./events/kprobes/vfs_write/format 
name: vfs_write
ID: 1914
format:
 field:unsigned short common_type; offset:0; size:2; signed:0;
 field:unsigned char common_flags; offset:2; size:1; signed:0;
 field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
 field:int common_pid; offset:4; size:4; signed:1;

 field:unsigned long __probe_ip; offset:8; size:8; signed:0;
 field:u64 inode; offset:16; size:8; signed:0;
 field:u32 dev; offset:24; size:4; signed:0;

print fmt: "(%lx) inode=0x%Lx dev=0x%x", REC->__probe_ip, REC->inode, REC->dev

Testing!

First we create a test file and get it's inode number.
root@mst-nb1:~# touch /tmp/testfile

# get the inode number of the test file
root@mst-nb1:~# ls -i /tmp/testfile
11014706 /tmp/testfile
Next we need to calculate the value that dev_t should have. The kernel uses the following macro to merge the major and minor numbers into a variable of type dev_t.
# include/linux/kdev_t.h
...
  6 #define MINORBITS       20
  7 #define MINORMASK       ((1U << MINORBITS) - 1)
  8 
  9 #define MAJOR(dev)      ((unsigned int) ((dev) >> MINORBITS))
 10 #define MINOR(dev)      ((unsigned int) ((dev) & MINORMASK))
 11 #define MKDEV(ma,mi)    (((ma) << MINORBITS) | (mi))
...
Well that bit shift should be simple enough to reimplement, so let's shift some bits!
I'll use Python to make those examples shorter:
# Getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where the file
# to monitor resides
mst@mst-nb1:~$ sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:1

# Reimplementing the kernel MKDEV macro with the input of 252 and 1,
# the result should be what we need to look for as dev_t
mst@mst-nb1:~$ python -c 'print(252 << 20 | 1)'
264241153
Next we create the filter for our inode number and dev_t value.
# Filter inode number 11014706 and a dev_t value 264241153
root@mst-nb1:/sys/kernel/debug/tracing# \
    echo "inode==11014706 && dev==264241153" > events/kprobes/vfs_write/filter

# Enable tracing
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > events/kprobes/vfs_write/enable 
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > tracing_on

# On terminal 2 I write into /tmp/testfile
root@mst-nb1:~# python -c 'fd = open("/tmp/testfile", "w"); fd.write("test"); fd.close()'

# Back on terminal 1 we check the content of trace
root@mst-nb1:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          python-6559  [004] d... 1406152.334025: vfs_write: \
            (vfs_write+0x0/0x1a0) inode=0xa81232 dev=0xfc00001
There's exactly one line showing our Python processes calling vfs_write and we didn't have to taint the kernel with custom modules or install any non-standard tools to get to that result!

Conclusion

There are many answers to the above question, some might be better and some might be less good, but the only way to see the trade offs is to be familiar with all the concepts. Obviously the described tools can also be used for more than only monitoring write operations, that was only a basic example.
Writing this article was a great exercise for me and I'd be happy about feedback with suggestions for improvements or other solutions.

2013/02/18

Splitting one MongoDB ReplicaSet into two

The Problem

We are running a Cluster of 4 MongoDB servers. They are all joined into one replicaset, called "rs1", which holds a lot of user data. All of the data is organized in some UserID oriented structure.
The load of these servers is getting higher and higher, most of which is caused by the write I/O, generated by thousands of updates per second. Additionally the disks are filling up. The project manager is complaining almost daily that some page loads take too long because the MongoDB responds slowly to the queries from PHP and the monitoring confirms his complaints.
                    ----------------------------------------
                    |                  RS1                 |
                    |                                      |
                    |   ---------------  ---------------   |
                    |   | Data node 1 |  | Data node 3 |   |
                    |   ---------------  ---------------   |
                    |                                      |
                    |   ---------------  ---------------   |
                    |   | Data node 2 |  | Data node 4 |   |
                    |   ---------------  ---------------   |
                    |                                      |
                    ----------------------------------------

The Solution

Since most of the load on these servers is caused by write, and not by read queries, adding more servers to the cluster will not really solve the problem, because every node in a replicaset replicates every update. So we decided that the only sensible thing to do is to introduce sharding. 
To shard our data would be relatively easy, because all of it's queries and updates are by UserID anyway. So basically we could distribute all the data by UserID % 2 into two replicasets, and then decide inside the client code (PHP), based on the UserID, to which replicaset each query has to go. 
Since, in our case, its so simple to implement sharding on the client side (PHP), we decided to not use MongoDBs own sharding feature. It would only be an additional point of failure and add complexity, without giving us an advantage.
                    -----------------------   -----------------------
                    |         RS1         |   |         RS2         |
                    |                     |   |                     |
                    |   ---------------   |   |   ---------------   |
                    |   | Data node 1 |   |   |   | Data node 3 |   |
                    |   ---------------   |   |   ---------------   |
                    |                     |   |                     |
                    |   ---------------   |   |   ---------------   |
                    |   | Data node 2 |   |   |   | Data node 4 |   |
                    |   ---------------   |   |   ---------------   |
                    |                     |   |                     |
                    |   ---------------   |   |   ---------------   |
                    |   |  Arbiter 1  |   |   |   |  Arbiter 2  |   |
                    |   ---------------   |   |   ---------------   |
                    |                     |   |                     |
                    -----------------------   -----------------------
By implementing this configuration, the amount of updates that goes to each data node is only 50% of the previous configuration, while the amount of reads per node is not going to change. The amount of disk space used is also reduced by almost half (except the oplog). 
The arbiter nodes are necessary because every MongoDB replicaset needs to have at least 3 active nodes to perform elections, but they don't hold any data and can basically run on any tiny machine without consuming resources.

Coming up with a Migration procedure

So we thought we have the perfect solution for our problem, but it seems that MongoDB replicasets are not really made for being split into multiple replicasets. We also weren't successful in googling for other people who did a similar things.
Of course it would be possible to mongodump all data, break the replicaset, create two new ones, and then mongorestore half of the data into each replicaset. But these tools take many hours to work on big amounts of data (>200G in our case), and during the time of the migration the whole cluster would need  to go offline. We can't afford to have many hours of downtime on this cluster, so we were looking for the migration procedure that involves the least downtime possible, no matter how nice or ugly the procedure itself is. 
Coming up with this solution took quite some research time, lots of trial and error in the lab, and the resulting procedure is not exactly simple.
A short explanation of the migration would be as follows. We remove two nodes from the replicaset, change their configs to become a new one, start them as new replicaset, add an arbiter to each of the replicasets to fullfill the requirement of having 3 nodes, change the clientcode to implement sharding, and then remove the unused data from each replicaset.
I'm sharing the exact process of how we did it here in the hope to save somebody else the research. In total it includes around 12 steps, of which I'm going to describe each in detail.
This whole procedure should be done during a time when the cluster is less busy than the maximum peak time. Altough there is -no downtime- (yay!), the resources which are available to handle queries are decreased by 50% during some time of the migration.

Preparing a Test Environment

To demonstrate the procedure I'm first creating a test environment of 4 data nodes and join them into a replicaset called "rs1".
mst@mst-gentoo-nb ~/ $ for i in 1 2 3 4; do mongod --dbpath ~/mongodb/instance4 --nojournal --port 27047 --logpath ~/mongodb/instance4/mongo.log --logappend --rest --replSet rs1 --oplogSize 128 --maxConns 20 --fork; done 
forked process: 14736
all output going to: /home/mst/mongodb/instance1/mongo.log
child process started successfully, parent exiting
forked process: 14780
all output going to: /home/mst/mongodb/instance2/mongo.log
child process started successfully, parent exiting
forked process: 14824
all output going to: /home/mst/mongodb/instance3/mongo.log
child process started successfully, parent exiting
forked process: 14868
all output going to: /home/mst/mongodb/instance4/mongo.log
child process started successfully, parent exiting
Now I join all of them into the replicaset "rs1". To do that I connect to one of them, create the replicaset config, and do an rs.initiate(conf).
mst@mst-gentoo-nb ~/ $ mongo --port 27017
MongoDB shell version: 2.2.3
connecting to: 127.0.0.1:27017/test
> conf={
...         "_id" : "rs1",
...         "version" : 1,
...         "members" : [
...                 {
...                         "_id" : 0,
...                         "host" : "localhost:27017"
...                 },
...                 {
...                         "_id" : 1,
...                         "host" : "localhost:27027"
...                 },
...                 {
...                         "_id" : 2,
...                         "host" : "localhost:27037"
...                 },
...                 {
...                         "_id" : 3,
...                         "host" : "localhost:27047"
...                 }
...         ]
... }
> rs.initiate(conf)
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}
To be sure that during the migration procedure no data gets lost, I will add some testdata into the replicaset.
 rs1:PRIMARY> use mydb1
 switched to db mydb1
 rs1:PRIMARY> db.createCollection("mycol1")
 { "ok" : 1 }
 rs1:PRIMARY> db.mycol1.insert({"_id":0, "key1":"value1"})
 rs1:PRIMARY> db.mycol1.insert({"_id":1, "key2":"value2"})
 rs1:PRIMARY> db.mycol1.insert({"_id":2, "key3":"value3"})
 rs1:PRIMARY> db.mycol1.insert({"_id":3, "key4":"value4"})

Each migration step in detail

Prepare 2 new arbiter nodes

We will need two arbiter nodes because each replicaset needs to have at least 3 nodes to hold elections. Currently we have only 4 active nodes to split into 2 replicasets. So we will add 1 arbiter to each of the 2 replicasets to reach the total of 3 nodes.
In the test environment I'll simply create a new empty directory for each of them to use as their dbpath.
mst@mst-gentoo-nb ~/mongodb $ mkdir arb1 arb2

Stop data node 3 and 4

It's best to not do this during the busiest peak times, because the amount of read queries that can still be handled is now drastically reduced.
mst@mst-gentoo-nb ~/mongodb $ cat instance3/mongod.lock instance4/mongod.lock 
14824
14868
mst@mst-gentoo-nb ~/mongodb $ kill 14824 14868

Add 1 new arbiter node to the currently running cluster

1 of the 2 new arbiter nodes can now be started and added into the currently running cluster.
mst@mst-gentoo-nb ~/ $ mongod --dbpath ~/mongodb/arb1 --nojournal --port 27057 --logpath ~/mongodb/arb1/mongo.log --logappend --rest --replSet rs1 --oplogSize 128 --maxConns 20 --fork                                              
forked process: 15988
all output going to: /home/mst/mongodb/arb1/mongo.log
child process started successfully, parent exiting
Then I connect to the current primary and add the arbiter.
rs1:PRIMARY> rs.addArb("localhost:27057")
{ "ok" : 1 }

Deploy a version of the client code

To make sure that during the next few steps the nodes 3 and 4 are not being used by the client, in our case PHP, we need to deploy a new version of the code which only uses the node's 1 and 2. For doing that we simply change its configuration to use only the IPs of these 2 nodes, instead of all 4, as replicaset "rs1".

Remove node 3 and 4 from the replicaset configuration

The datanode 3 and 4 are currently still in the config of replicaset "rs1". They will simply not be used because the replicaset detects them to be down. In the future, these 2 nodes will become the datanodes for the new replicaset "rs2", so we can remove them from the config of "rs1" on the primary node.
rs1:PRIMARY> rs.conf()
{
        "_id" : "rs1",
        "version" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "localhost:27017"
                },
                {
                        "_id" : 1,
                        "host" : "localhost:27027"
                },
                {
                        "_id" : 2,
                        "host" : "localhost:27037"
                },
                {
                        "_id" : 3,
                        "host" : "localhost:27047"
                },
                {
                        "_id" : 4,
                        "host" : "localhost:27057",
                        "arbiterOnly" : true
                }
        ]
}
rs1:PRIMARY> rs.remove("localhost:27037")
rs1:PRIMARY> rs.remove("localhost:27047")

Start node 3 and 4 without the --replSet parameter

Next we are going to manually edit the definition of the replicaset inside node 3 and 4. For doing that we want to start them without any replicaset being active, so we just remove the --replSet parameter.
for i in 3 4; do mongod --dbpath ~/mongodb/instance${i} --nojournal --port 270${i}7 --logpath ~/mongodb/instance${i}/mongo.log --logappend --rest --oplogSize 128 --maxConns 20 --fork; done
forked process: 24482
all output going to: /home/mst/mongodb/instance3/mongo.log
child process started successfully, parent exiting
forked process: 24492
all output going to: /home/mst/mongodb/instance4/mongo.log
child process started successfully, parent exiting
Ok, the node 3 and 4 are started. Now we need to store a definition of the new replicaset "rs2" into their local.system.replset. Let's first look at what's there already.
> use local
switched to db local
> db.system.replset.find()
{ "_id" : "rs1", "version" : 2, "members" : [   {       "_id" : 0,      "host" : "localhost:27017" },   {       "_id" : 1,       "host" : "localhost:27027" },   {       "_id" : 2,      "host" : "localhost:27037" },   {       "_id" : 3,       "host" : "localhost:27047" },   {       "_id" : 4,      "host" : "localhost:27057",     "arbiterOnly" : true } ] }
Obviously thats still the old config, including node 3 and 4. That's because at the moment when we removed them, they were down, so they couldn't replicate this change yet.
Build the config for rs2 and store it into the variable "conf". Note that port 27067 is the port which the second new arbiter will use once it is started.
> conf = { "_id" : "rs2", "version" : 2, "members" : [
... { "_id" : 0, "host" : "localhost:27037" },
... { "_id" : 1, "host" : "localhost:27047" },
... { "_id" : 2, "host" : "localhost:27067", "arbiterOnly" : true }
... ]}
{
        "_id" : "rs2",
        "version" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "localhost:27037"
                },
                {
                        "_id" : 1,
                        "host" : "localhost:27047"
                },
                {
                        "_id" : 2,
                        "host" : "localhost:27067",
                        "arbiterOnly" : true
                }
        ]
}
Insert the new config into db.system.replset
> db.system.replset.insert(conf)
Finally remove the old config for the replicaset "rs1"
> db.system.replset.remove({"_id":"rs1"})
These steps have to repeated exactly the same way for each of node 3 and 4.

Activate the replicaset on node 3 and 4

Stop node 3 and 4 again
mst@mst-gentoo-nb ~/mongodb $ cat instance3/mongod.lock instance4/mongod.lock 
14824
14868
mst@mst-gentoo-nb ~/mongodb $ kill 14824 14868
Restart them the with --replSet parameter. Make sure that --replSet specifies the name of the new replicaset, which in our case is "rs2". Now they will try to resync the replicaset, but realize that the new arbiter isn't reachable yet.
for i in 3 4; do mongod --dbpath ~/mongodb/instance${i} --nojournal --port 270${i}7 --logpath ~/mongodb/instance${i}/mongo.log --logappend --rest --replSet rs2 --oplogSize 128 --maxConns 20 --fork; done
forked process: 24947
all output going to: /home/mst/mongodb/instance3/mongo.log
child process started successfully, parent exiting
forked process: 25002
all output going to: /home/mst/mongodb/instance4/mongo.log
child process started successfully, parent exiting

Start the second arbiter node

Now the node 3 and 4 are using their new replicaset "rs2", so they are already trying to connect to the second arbiter node. As soon as we start it on an empty dbpath, it should automatically join "rs2" and synchronize the config with node 3 and 4.
mst@mst-gentoo-nb ~/ $ mongod --dbpath ~/mongodb/arb2 --nojournal --port 27067 --logpath ~/mongodb/arb2/mongo.log --logappend --rest --replSet rs2 --oplogSize 128 --maxConns 20 --fork
forked process: 25840
all output going to: /home/mst/mongodb/arb2/mongo.log
child process started successfully, parent exiting
By doing rs.status() on either node 3 or 4 we can see that they have automatically added the new arbiter after a few seconds

Tadaa! We have created a functioning second replicaset, which still contains all of the data that "rs1" contained.

Deploy the final client code

Since we now have 2 running replicasets which both contain all the data, we can deploy the client code that implements sharding. This code simply does a UserID % 2 before every mongo query, and based on the result it decides if it should use "rs1" or "rs2". As soon as the new code is active, only half of the data in each of the replicasets will be used.
To verify that all the data is still there, I connect to "rs1" and query it.
rs1:PRIMARY> use mydb1
switched to db mydb1
rs1:PRIMARY> db.mycol1.find()
{ "_id" : 0, "key1" : "value1" }
{ "_id" : 1, "key2" : "value2" }
{ "_id" : 2, "key3" : "value3" }
{ "_id" : 3, "key4" : "value4" }
Same on "rs2"
rs2:PRIMARY> use mydb1
switched to db mydb1
rs2:PRIMARY> db.mycol1.find()
{ "_id" : 0, "key1" : "value1" }
{ "_id" : 1, "key2" : "value2" }
{ "_id" : 2, "key3" : "value3" }
{ "_id" : 3, "key4" : "value4" }

Remove the unused half of data from both replicasets

Now we can safely remove the unused half of each replicaset. The implementation of this should be done by the client side. In our case it was really easy to implement, because the PHP was aware of which data has to be on which replicaset, so we could simply go through all UserIDs and remove them from the replicaset where they are not used.
In order to not cause any additional load we did the remove very slowly over around 3 months of time. Each night, when most of the users are sleeping anyway, and the db wasn't so busy, around 50k datasets were removed.
If at some time in the future we run into similar trouble again, with too much load caused by updates, we can even repeat the same procedure. We would have to first double the amount of hardware, grow the two replicasets into the new hardware, and then repeat this procedure on each of the replicasets.

Conclusion

The result of this whole migration was as expected. The load on each of the data nodes has decreased a lot, because the amount of updates has decreased by half. The speed of the updates has also increased, so the new performance numbers make the project manager happy.

2011/08/18

PECL-Memcache standard loadbalancer for Nginx

In a previous post i have already introduced my Nginx loadbalancer, which is compatible with the PECL-Memcache consistent hashing. Unfortunately we discovered that the implementation of the consistent hashing on the PHP side is quite inefficient since it rebuilds the hashring on every request, and in the case of our PHP framework even for each Memcache server on each request. So we had to decide between a Memcache proxy like Moxi to implement the consisten hashring, or just forget about the advantages of a consistent hash and save the additional hop of a moxi. The conclusion was to just use the standard PECL-Memcache loadbalancer, since this is probably the most resource efficient way to implement a cache, and then just live with having to regenerate some caches in case we add/remove Memcache instances.
So in order to make the Nginx find cache values which have been stored by PHP into Memcache, i wrote another Nginx loadbalancer which is simply copying the behaviour of the PECL-Memcache standard loadbalancer.
Example:
 70 upstream profilememcache {
 71     hash_key $memcached_key;
 72     server 10.20.0.27:11211;
 73     server 10.20.0.26:11211;
 74 }

-----

 570         location /profile {
 570         set $memcached_key $uri;
 571         error_page      500 404 405 = /fallback_php;
 572         memcached_pass  profilememcache;
 573         }
In the above example you see that the memcached_pass can be used just as always and in the upstream definition you simply need to add the parameter "hash_key" together with based on what the loadbalancer should do the hashing on.
Unfortunately its really important that the order of the Memcache servers in the upstream definition is the same as the order in which the Memcache servers are being added to the Memcache object in the PHP code. If the positions of the servers are not the same the whole thing won't work.

2011/06/22

PHP session parser in production

Last year I have posted about the Nginx ngx_http_php_session module that i wrote and which is able to parse php sessions and extract certain values from them. Unfortunately this took quite long time, but now i can finally announce that i am actually using this on a production system and so far it seems stable.
In this current case where I'm using the module in production now, it is used to optimize the way how Nginx is using the Memcaches. Since quite long time I already had a Memcache in production which is fed plain HTML by PHP and then read out and delivered by the Nginx. This works well as long as the content is more or less static and independent on the user which accesses it, and exactly this second requirement is now not necessary anymore. Using the ngx_http_php_session module the Nginx can check if a requesting user has a valid session, and if he does, it can check for things like payment status, gender, his preferences or whatever your site content is depending on. After this is known to the Nginx, the correct cache can then be retrieved accordingly. As a result there are many more pages that are cachable in a very efficient way.
Here i show an example how this module is used to decide if the Nginx should use the profile cache key for paying users or the one for non-paying users.
location /

    if ($cookie_session_id = "") {
        rewrite . /login redirect;
    }

    eval $session {
        set $memc_key $cookie_session_id;
        memc_pass session_memcache;
    }

    php_session_parse $authenticated $session "symfony/user/sfUser/authenticated";

    if ($authenticated != "b:1") {
        rewrite . /login redirect; #the user is logged out
    }

    php_session_parse $is_paying $session "symfony/user/sfUser/attributes|s:10:\"subscriber\";s:9:\"is_paying\"";
    php_session_parse $user $session "symfony/user/sfUser/attributes|s:10:\"subscriber\";s:11:\"getNickname\"";

    php_session_strip_formatting $stripped_user $user;

    if ($is_paying = "i:1") {
        rewrite . profile_cache$uri:paying last; # the user is paying
    }

    rewrite ^/(.+)$ $1;
    set $tmp_compare "$stripped_user:$uri";

    if ($tmp_compare ~* "^(.*):\1$") {
        rewrite . profile_cache$uri:paying last; # the user is viewing his own profile
    }

    rewrite . profile_cache$uri:non-paying last; # the user is non-paying
}

location profile_cache {
    default_type    text/html;
    add_header      "Content" "text/html; charset=utf8";
    lower           $uri_low $uri;
    set $memc_key $uri_low;
    memcached_next_upstream error timeout not_found;
    error_page      500 404 405 = /fallback_to_php;
    memc_pass  profilememcache;
}
In the previous example you can also see the directive php_session_strip_formatting. This is used because a string which is extracted from a PHP session is formatted, like for example s:7:astring. By using the php_session_strip_formatting directive the actual string astring will be extracted and the formatting will be stripped away.
Another directive from this example is lower. This directive is coming from another module that i wrote with the name lower_upper_case. It simply uppercases or lowercases a string in the Nginx config, and stores the result in another variable. In the above example I'm using it in to make the memcache keys case insensitive. To achieve this i just lowercase all keys on the Nginx, and also on the PHP side when it is filling the Memcache.
Both modules can be retrieved from github where I'm hosting the repositories:
https://github.com/replay/ngx_http_lower_upper_case
https://github.com/replay/ngx_http_php_session

2010/07/27

Generate secured download links on the fly in Nginx

I created a module which can be used to generate secure download links directly in the Nginx. Those links can then be verified by the NginxHttpSecureDownload module.
I call the new module NginxHttpGenerateSecureDownloadLinks and it is supposed to _only_ be used over the Nginx SSI module. This works as in the following example:
37         location / {
 38             ssi on;
 39             root   html;
 40         }
 41 
 42        location /gen_sec_link {
 43             internal;
 44             rewrite /gen_sec_link(.*)$ $1 break;
 45             generate_secure_download_link_expiration_time 3600;
 46             generate_secure_download_link_secret $remote_addr;
 47             generate_secure_download_link_url $uri;
 48             generate_secure_download_link on;
 49         }
If you now access for example /test.html and the output should contain any secured download links, then test.html should contain SSI tags like in this example:

If you payed attention you might already be expecting that those ssi tags are getting replaced by secured links, the html output will look like this:
thisisatest
<a href="http://www.blogger.com/this_is_my_link/509325bc5fac6e4e42687fe096d67a9d/4C4EC7C3">this_is_my_link</a>
some text abc
<a href="http://somewhateverhost.com/this_is_another_link/badbcb4d20500cca464c609da41001b2/4C4EC7C3">this_is_another_link</a>
more text
I will give you the link to the module here. But please keep in mind that i uploaded this today and its NOT tested well, i'm glad about feedback tough.

2010/06/03

Nginx does not know about the DNS names of specified backend servers?

I did already mention this problem in one previous post, still I would like to talk about it one more time because it kind of bothers me.
While writing an Nginx loadbalancer module I discovered that when you define a backend server using the server directive in an upstream block in the config file, even if you specify the upstream by using a DNS name, Nginx doesn't seem to keep the information of the actual DNS name anywhere available. It only keeps the resolved IPs after it did the name resolution once on startup.
The reason why this is a problem in my case is because I write a loadbalancer which should always find exactly the same backend server as the PHP-Memcache module load balancer does. Since the PHP-Memcache loadbalancer decides based on a hash of the actual strings that got passed to the addServer method, it is crucial for my Nginx loadbalancer to be able to do the same. Which means that if somebody in the Nginx config specifies an IP i will have to hash the string of the IP, if somebody enters the hostname of the backend server i will have to hash the hostname.
To make the whole thing work as I wanted I had to do what I really hated to do, I created a DNS supporting and a non-DNS branch of my loadbalancer module on github. While my neckhairs stood on end just by the thought of what I am going to do, I added a patch for the upstream module to the DNS aware branch which doesn't do more than adding a string to the struct ngx_http_upstream_server_t:
typedef struct {
 +    ngx_str_t                        name;
     ngx_addr_t                      *addrs;
     ngx_uint_t                       naddrs;
     ngx_uint_t                       weight;
     ngx_uint_t                       max_fails;
     time_t                           fail_timeout;
 
     unsigned                         down:1;
     unsigned                         backup:1;
 } ngx_http_upstream_server_t;
Plus I added one line to the ngx_http_upstream_server function which stores the DNS name of this backend into the above added ngx_str_t:
+    us->name = u.url;
    us->addrs = u.addrs;
    us->naddrs = u.naddrs;
    us->weight = weight;
    us->max_fails = max_fails;
    us->fail_timeout = fail_timeout;
After applying this patch on the nginx source, it required only minor changes to the loadbalancer module to make it use the DNS names instead of resolved IPs. What I don't understand is why the author of the upstream module decided that he should spare those few bytes and not store the actual DNS names that are specified in the config file.
However, hereby I would like to solicit the author of the upstream module or whoever has the power to do this to add this simple ngx_str_t to the ngx_http_upstream_server_t. Thanks