mauro-stettler: Monitoring a file to get details about write operations

Today I've been answering questions on unix.stackexchange.com, when I've met one which at first sight seemed easy to answer, but after rethinking it a few times I realized that actually I didn't know the proper answer (That's the question in question). This made me investigate and led me down a rabbit hole which consumed most of this afternoon, and the next one, and some more time. The question asks about how to figure out which process is writing to a certain log file. The person asking can see that something is writing to it by reading the file, but she doesn't know which process is writing it. Obviously there are some mechanisms in Syslog which could solve that problem if the file is written by Syslog, but out of curiosity I wanted to know if there's a generic mechanism that works for any file, so I'm going to go through a few possible solutions without concluding on a best one.

Inotify

One of the first things that came to my mind when I read the question was Inotify. Inotify is a mechanism that's provided by the Linux kernel via a set of system calls, it can be used to set watchers on files and directories. If one of the watched files gets modified in a way that the watcher is waiting for then an event gets passed to the user space process. This is great, I can use it to put a watcher on a file and make it wait for anything writing to that file. There are many different wrappers around the Inotify API, one of my favorite ones is a CLI utility called inotifywatch which comes with the package inotify-tools.

mst@mst-nb1:~$ inotifywait 123
Setting up watches.
Watches established.    # now i write to "123" in another terminal
123 MODIFY

Ok, so we got notified that there was a write event on the file, but we still don't know what has written to it. Unfortunately the Inotify API does not give any hints regarding which process has done a given modification on a file. Considering that the reading of the events in user space happens asynchronously this makes sense, because at the time the event is processed by the user space application the writing process might already have exited. The alternative would be to copy a lot of information about the process into the Inotify event structure, but this would increase it's size drastically and would hence decrease the performance.

For reference, this is the structure that describes an Inotify event:

# include/uapi/linux/inotify.h
...
 14 /*
 15  * struct inotify_event - structure read from the inotify device for each event
 16  *
 17  * When you are watching a directory, you will receive the filename for events
 18  * such as IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, ..., relative to the wd.
 19  */
 20 struct inotify_event {
 21         __s32           wd;             /* watch descriptor */
 22         __u32           mask;           /* watch mask */
 23         __u32           cookie;         /* cookie to synchronize two events */
 24         __u32           len;            /* length (including nulls) of name */
 25         char            name[0];        /* stub for possible name */
 26 };
...

Lsof

My next choice when trying to figure out what's writing to a file is the lsof utility. lsof is great to see any kind of status information about running processes and the file descriptors they're currently holding open.

# In terminal 1 I create a cat process which holds an open file descriptor 
# to /tmp/abc and sits there waiting for input
mst@mst-nb1:~$ cat > /tmp/abc

# In terminal 2 I run lsof on /tmp/abc
mst@mst-nb1:~$ lsof /tmp/abc
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
cat     18420  mst    1w   REG  252,1        0 11010067 /tmp/abc

The example shows that lsof correctly displays that there's a cat process holding file descriptor with the number 1 in write mode and that this file descriptor refers to the file /tmp/abc.

This is great, it shows us what we want to know. But what if the process that modifies the file only runs for 0.01 seconds and then exits or closes the file again? Even when applying the maximum of my typing skills while my finger joints are at room temperature I'd still have a good chance to miss it. Due to the way lsof collects it's data there is no way to watch a file for modifications over a time span, it can only capture snapshots. It's repeat mode, when passing the -r option, doesn't help us much because it only captures multiple snapshots at an interval. Looks like we still haven't found a really good solution to our problem.

LoggedFS

Let's get a little more exotic and venture into the land of non-standard utilities. Loggedfs is a fuse based file system, that means it runs in user space, and it greatly increases the verbosity of all operations running on the file system while still letting the "real" file system do it's job underneath. Luckily the Ubuntu repositories provide it as a package under the name loggedfs, so it's easy to install. That's how that looks like:

# In terminal 1 I'm mounting Loggedfs on an existing directory while
# running it in foreground to directly see the output
mst@ubuntu:~$ loggedfs -f /tmp/testdir
09:30:27 (src/loggedfs.cpp:678) LoggedFS not running as a daemon
09:30:27 (src/loggedfs.cpp:818) LoggedFS starting at /tmp/testdir.
09:30:27 (src/loggedfs.cpp:843) chdir to /tmp/testdir

# In terminal 2 I'm writing to one of the already existing files
mst@ubuntu:~$ echo "bla" > /tmp/testdir/abc

# Back in terminal 1 Loggedfs shows me all the details about what just happened
 09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) open writeonly /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) truncate /tmp/testdir/abc to 0 bytes \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) write 4 bytes to /tmp/testdir/abc at offset 0 \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]  
 09:30:29 (src/loggedfs.cpp:138) 4 bytes written to /tmp/testdir/abc at offset 0 \
    {SUCCESS} [ pid = 20428 -bash uid = 1000 ]

This is great, the output shows us exactly which process has written to our file and we could hardly wish for more details. But there are still some things to consider:

Mounting only works with an entire directory, it does not work on single files, so exclude and include patterns have to be used to specify which file exactly needs to be monitored.
Depending on company policy it might be problematic to install non-default packages on production systems just for troubleshooting.
Some people might dislike the idea of having to run fuse on a production box.

With LoggedFS we've found a really nice and workable solution which tells us what we wanted to know. But actually the Kernel already has all the information that we want, so why would we have to run something else on top of it to collect the same data a second time? Let's see if by digging down into the Kernel we can't find some mechanism that doesn't require external packages or user space tools and still provides us with all the information we are looking for.

SystemTap

When starting to look at the more advanced kernel features it gets interesting, I'll start with SystemTap. SystemTap is a utility that originally comes from RedHat, but it's Open Source and usable on all major distros. It makes it easy to generate simple kernel modules which can be used to inspect what's happening in the kernel by creating event handlers for various types of events.

Besides the ability to attach event handlers to most kernel functions which are exported as symbols, SystemTap comes with a large collection of "tapsets". Tapsets are predefined scripts and events that can be used in custom tap scripts. One of those available events is called vfs.write. vfs stands for Virtual File System and it's basically an abstraction layer on top of the actual file system implementation to handle generic operations like opening, closing, writing and reading files so they don't need to be reimplemented by each file system implementation again and again.

Once SystemTap is installed, the common way to use it is by writing script files with the mime type stp which are then read by the stap utility. This utility translates them into C code, compiles them as a kernel module and then loads them into the kernel. The structure of the scripts is usually rather simple, they can set probes on events and they define functions to make code reusable.

Identifying what we're looking for

In our case we want to monitor the write operations on a given file, so first we'll need to figure out how to identify a specific file in a vfs.write event. When querying the stap utility for the available variables provided by the event vfs.write we can see that there is a file variable.

root@ubuntu:~# stap -L vfs.write
vfs.write file:long pos:long buf:long bytes_to_write:long dev:long devname:string \
    ino:long name:string argstr:string $file:struct file* $buf:char const* \
    $count:size_t $pos:loff_t*

Alternatively we can also just look at the definition of vfs.write to get even more details.

# /usr/local/systemtap/share/systemtap/tapset/linux/vfs.stp
...
 950 probe vfs.write = kernel.function("vfs_write")
 951 {
 952         file = $file
 953         pos = $pos
 954         buf = $buf
 955         bytes_to_write = $count
 956         dev = __file_dev($file)
 957         devname = __find_bdevname(dev, __file_bdev($file))
 958         ino = __file_ino($file)
 959 
 960         name = "vfs.write"
 961         argstr = sprintf("%d, %d, %p", $count, $pos, $buf)
 962 }
...

But how to uniquely identify a file based on the available variables? Each file on the file system is associated with an inode number and each inode number is unique within it's file system, so within a file system we can use the inode number to identify a file. To identify the file system we can use the associated device's major and minor numbers, this means that the combination of ma/mi numbers plus the inode number should be unique in the whole system and give us a reliable identifier that we can use to identify any file.

In the kernel the major and minor numbers are merged using a bit shift operation and a logical or, the result is a device identifier which the vfs.write event provides as variable dev. The inode number is available as variable ino. One fact that will be important in a later chapter is that all of that data gets extracted from the file variable by the two functions __file_dev and __file_ino.

In order to be able to look for a certain value in the dev variable we need to reproduce the merge of major and minor numbers that the kernel does internally, luckily SystemTap comes with a function called MKDEV which does exactly that.

Let's try that!

Now we have all the information we need to write our stap script:

# write.stp
#
# monitor writes on file
probe vfs.write
{
    if (dev == MKDEV($1,$2) && ino == $3) {
       printf("got a write from %s (%d)\n", execname(), pid());
    }
}

The $1, $2 and $3 are arguments which must be provided on the command line when running the script.

root@ubuntu:~# touch /tmp/testfile

# get inode number of testfile
root@ubuntu:~# ls -i /tmp/testfile
14 /tmp/testfile

# getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where /tmp/testfile resides.
root@ubuntu:~# sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:0

# pass major, minor and inode numbers to stap script
root@ubuntu:~# stap ./write.stp 252 0 14

# on terminal 2 we write something into the testfile
root@ubuntu:~# echo "test" > /tmp/testfile

# back on terminal 1 we're getting output:
root@ubuntu:~# stap ./write.stp 252 0 14
got a write from bash (868)

Great, we're monitoring the file for write operations and for each occurring operation we can get the process name and pid of the process that's writing! SystemTap, with all the tapsets that it ships with, really makes investigating kernel internals relatively easy and this by itself is quite a feature. It also doesn't require you to install it's entire tool chain on every system that you want to use it on because after SystemTap built your kernel module on one machine you can copy the module to every other machine with a compatible kernel and load it there. Unfortunately it can currently still be slightly troublesome to set the tool chain up on Debian based distributions like Ubuntu and it may require a small amount of hacking to make it work, on RedHat it should work out of the box though.

Seems we've found another pretty good solution for our problem, but I wish there was an easy way to achieve our goal of monitoring the writes to a file without having to load a custom kernel module on a production server... Let's try Auditd instead.

Auditd

Just like SystemTap, Auditd originated from RedHat but is Open Source and usable on other distributions too. It consists of several parts: The data collection happens inside the Kernel via a bunch of hooks placed at critical locations, then the collected data is first filtered through a set of filters that live inside the Kernel space, the resulting data is sent to a user space daemon (auditd) via a netlink socket (man 7 netlink). auditd then communicates with other user space processes which want to access the collected data or modify filter rules, it also maintains log files containing the collected data for other tools to read. If user space processes submit commands to modify the filter rules to auditd it uses the same netlink socket to configure the in-kernel data filters. The RedHat documentation illustrates that architecture nicely.

The easiest way to setup filter rules is by using the CLI utility auditctl, which comes together with auditd. When reading man 8 auditctl it becomes obvious that this is a quite mature and powerful tool to configure a wide range of settings and filters, but let's first focus on our rather simple task of watching writes on a file. Here are the relevant parts of auditctl's man page:

...
 -w path
    Insert a watch for the file system object at path. You cannot insert a watch to  the
    top  level  directory. This is prohibited by the kernel. Wildcards are not supported
    either and will generate a warning. The way that watches work  is  by  tracking  the
    inode  internally. If you place a watch on a file, its the same as using the -F path
    option on a syscall rule. If you place a watch on a directory, its the same as using
    the -F dir option on a syscall rule. The -w form of writing watches is for backwards
    compatibility and the syscall based form is more  expressive.  Unlike  most  syscall
    auditing  rules, watches do not impact performance based on the number of rules sent
    to the kernel. The only valid options when using a watch are the -p and -k.  If  you
    need  to  anything  fancy  like audit a specific user accessing a file, then use the
    syscall auditing form with the path or dir fields. See the EXAMPLES section  for  an
    example of converting one form to another.

...
 -k key Set a filter key on an audit rule. The filter key is an  arbitrary  string  of  text
    that can be up to 31 bytes long. It can uniquely identify the audit records produced
    by a rule. Typical use is for when you have several rules that  together  satisfy  a
    security requirement. The key value can be searched on with ausearch so that no mat‐
    ter which rule triggered the event, you can find its results. The key  can  also  be
    used on delete all (-D) and list rules (-l) to select rules with a specific key. You
    may have more than one key on a rule if you want to be able to search logged  events
    in  multiple ways or if you have an audispd plugin that uses a key to aid its analy‐
    sis.
...
 -p [r|w|x|a]
    Describe the permission access type that  a  file  system  watch  will  trigger  on.
    r=read,  w=write, x=execute, a=attribute change. These permissions are not the stan‐
    dard file permissions, but rather the kind of syscall that would  do  this  kind  of
    thing.  The  read  & write syscalls are omitted from this set since they would over‐
    whelm the logs. But rather for reads or writes, the open flags are looked at to  see
    what permission was requested.
...

Let's try that with the file /tmp/testfile that we've already used in the SystemTap test. First I'll set up a rule that places a watcher on writes to that file, then I'll write into it, and finally I'll use the ausearch utility to search through the audit log files for the key I've defined in the watcher rule. The ausearch utility gets delivered as part of the auditd package.

# Create a rule to watch writes on /tmp/testfile and mark them
# with the key "testfile"
mst@mst-nb1:~$ sudo auditctl -w /tmp/testfile -p w -k testfile

# Verify that the rule has been added
mst@mst-nb1:~$ sudo auditctl -l
-w /tmp/testfile -p w -k testfile

# Write into the test file
mst@mst-nb1:~$ echo "test" >> /tmp/testfile

# Search log files for the key "testfile"
mst@mst-nb1:~$ sudo ausearch -k testfile
----
time->Sat Feb  6 15:52:28 2016
type=CONFIG_CHANGE msg=audit(1454745148.692:36790): auid=4294967295 ses=4294967295 \
    op="add_rule" key="testfile" list=4 res=1
----
time->Sat Feb  6 15:52:36 2016
type=PROCTITLE msg=audit(1454745156.808:36794): proctitle="bash"
type=PATH msg=audit(1454745156.808:36794): item=1 name="/tmp/testfile" inode=11012788 \
    dev=fc:01 mode=0100777 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PATH msg=audit(1454745156.808:36794): item=0 name="/tmp/" inode=11010049 \
    dev=fc:01 mode=041777 ouid=0 ogid=0 rdev=00:00 nametype=PARENT
type=CWD msg=audit(1454745156.808:36794):  cwd="/home/mst"
type=SYSCALL msg=audit(1454745156.808:36794): arch=c000003e syscall=2 success=yes \
    exit=3 a0=2298f88 a1=441 a2=1b6 a3=7ffe7bb3eb48 items=2 ppid=30721 pid=32347 \
    auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 \
    sgid=1000 fsgid=1000 tty=pts2 ses=4294967295 comm="bash" exe="/bin/bash" \
    key="testfile"

Nice, we got exactly the output that we were looking for again! The auditd way of creating those watcher rules is really amazingly powerful, user friendly and easy to use. Personally I think that the auditd utility should be at the front of every operations guy's utility belt, somewhere between tcpdump and strace.

I could be satisfied with all the already discovered solutions here, but I'm still curious if there isn't any way to get that data without having to use any user space utilities and just directly surgically extract them from the Kernel. That's where Ftrace and Kprobe come in.

Ftrace & Kprobe

Ftrace is a Linux Kernel internal tracer that captures information about the current context when certain events occur and conditions are met, it exports it's control interfaces via a virtual file system called tracefs which is usually mounted in the directory tracing of the debugfs file system. By default most distributions mount debugfs on /sys/kernel/debug so Ftrace's files are in /sys/kernel/debug/tracing. From now on, when I refer to paths of tracefs files, their paths should always be interpreted as relative to the mount point of tracefs. When browsing through that directory it might at first seem overwhelming due to the large number of files and directories, but luckily the kernel comes with a really great documentation.

Basic event tracking with Ftrace

The list of available tracers is in the file /available_tracers, but since we only want some basic information and do not need any advanced tracing mechanisms the simple nop tracer is sufficient for us. It can be selected by writing it's name into the file /current_tracer.

root@ubuntu:~# cat /sys/kernel/debug/tracing/available_tracers 
blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
root@ubuntu:~# echo -n "nop" > /sys/kernel/debug/tracing/current_tracer

Before we start capturing anything let's first make sure the tracing is disabled and then clear the current trace output, that way we won't have any other output cluttering our trace output. The tracing can be enabled/disabled by writing 1/0 into the file /tracing_on. The trace output can be cleared by truncating the file /trace, which will later contain our trace output.

root@ubuntu:~# echo -n "0" > /sys/kernel/debug/tracing/tracing_on 
root@ubuntu:~# echo -n "" > /sys/kernel/debug/tracing/trace

Ftrace knows a large number of events and it allows us to choose which events we're interested in. The events are shown to the user in two ways, as a directory hierarchy in the directory /events and as a long list in the file /available_events. Because I like the hierarchic structure that the /events directory uses I'll stick to that one for now.

On the first level of directories in /events there is one sub directory per type of event, since we're looking for write operations I'll first focus on the system calls in /events/syscalls. Doing an ls on /events/syscalls shows that for each system call there is one event when entering the call and one when exiting it, so let's look at the content of /events/syscalls/sys_enter_write.

root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/ 
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept4
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_access
...
root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/sys_enter_write
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
-r--r--r-- 1 root root 0 Jan 26 13:47 format
-r--r--r-- 1 root root 0 Jan 26 13:47 id
-rw-r--r-- 1 root root 0 Jan 26 13:47 trigger

Each of the event directories contains the same set of files and unsurprisingly enable is used to disable/enable the tracing of this event. Let's first disable the tracing of all events and then only enable that one that we're interested in. For convenience the upper level directories /events/syscalls and /events also contain a file enable which can be used to propagate a change through all the sub directories. The file /set_event always contains a list of the currently enabled events (and it could also be used to control the enabling/disabling of events) so we read that file to verify that now the only selected event is sys_enter_write.

root@ubuntu:/sys/kernel/debug/tracing# echo -n "0" > events/enable 
root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > events/syscalls/sys_enter_write/enable 
root@ubuntu:/sys/kernel/debug/tracing# cat set_event 
syscalls:sys_enter_write

Seems we're ready to start tracing, but we don't want to get flooded with data, so we'll only enable the tracing for about 5 seconds. Then we should be able to see the output of the trace in the file /trace and depending on how busy your system is there might be quite a lot of data.

root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > tracing_on; sleep 5; echo -n "0" > tracing_on
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 108/108   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-1261  [000] ....  3360.093713: sys_write(fd: 3, buf: 7f7e25a22de0, count: 24)
            sshd-1198  [000] ....  3360.699635: sys_write(fd: a, buf: 7f568d193d90, count: 1)
            bash-1284  [000] ....  3360.699682: sys_write(fd: 2, buf: 76c008, count: 1)
            sshd-1198  [000] ....  3360.699699: sys_write(fd: 3, buf: 7f568d178680, count: 24)
            sshd-1198  [000] ....  3360.778897: sys_write(fd: a, buf: 7f568d193d90, count: 1)
            bash-1284  [000] ....  3360.778941: sys_write(fd: 2, buf: 76c008, count: 1)

Cool, so we get some data there! Unfortunately we don't know what that is, because the only file identifier given is the fd for file descriptor. Since file descriptors are only unique in process context we can't use them to filter for the specific file we're interested in because we don't know which process is writing to that file, that's what we're trying to find out! To solve that problem we can use Kprobe to define custom events and variables to filter by.

Custom events with Kprobe

Kprobes work by placing breakpoint instructions at specific places in the kernel and then collect data associated to the current context, if you want to know all the tiny details about how it works then the kernel documentation shall satisfy your curiosity. Defined Kprobes can be used by Ftrace just like any other event and they're controlled using the same files as described above, but the important feature that we need right now is that Kprobe allows us to dereference function arguments in order to later filter by members of structures in those arguments. This will help us to filter out events on a specific file.

Due to the flexibility of Kprobe we can get information about basically anything that happens in the kernel, we just need to know what we're looking for and how to filter it out of all the other stuff happening. Let's revert back to the function vfs_write that we've already used with SystemTap to monitor writes. If SystemTap is able to retrieve the criteria we require from the function arguments of vfs_write, then there should be some way how we can do the same with Kprobe if we just figure out where that data resides.

Looking at the kernel code, we can see in the function definition of vfs_write that the file variable is of type struct file.

# fs/read_write.c:547
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)

You might remember from the SystemTap example above that the vfs.write tap extracted all the information about the inode number and the associated device's major and minor numbers from the file argument, so we need to inspect the kernel code that defines the involved structures to figure out where that data is.

# include/linux/fs.h
...
 584 struct inode {
 585         umode_t                 i_mode;
 586         unsigned short          i_opflags;
 587         kuid_t                  i_uid;
 588         kgid_t                  i_gid;
 589         unsigned int            i_flags;
 590 
 591 #ifdef CONFIG_FS_POSIX_ACL
 592         struct posix_acl        *i_acl;
 593         struct posix_acl        *i_default_acl;
 594 #endif
 595 
 596         const struct inode_operations   *i_op;
 597         struct super_block      *i_sb;
 598         struct address_space    *i_mapping;
 599 
 600 #ifdef CONFIG_SECURITY
 601         void                    *i_security;
 602 #endif
 603 
 604         /* Stat data, not accessed from path walking */
 605         unsigned long           i_ino;
...
 839 struct file {
 840         union {
 841                 struct llist_node       fu_llist;
 842                 struct rcu_head         fu_rcuhead;
 843         } f_u;
 844         struct path             f_path;
 845         struct inode            *f_inode;       /* cached value */
...
1289 struct super_block {
1290         struct list_head        s_list;         /* Keep this first */
1291         dev_t                   s_dev;          /* search index; _not_ kdev_t */
...

What we see in those struct definitions is that the inode number is available from file->f_inode->i_ino as a plain unsigned long and the major and minor numbers are accessible as file->f_inode->i_sb->s_dev as type dev_t. The kernel uses macros to convert dev_t typed variables into major and minor numbers and back, we'll come back to that in a minute.

Defining the filter criteria by using custom variables

The kprobetrace documentation comes with a section describing the syntax used to define kprobe events and their custom variables via tracefs.

 23 Synopsis of kprobe_events
 24 -------------------------
 25   p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]  : Set a probe
 26   r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]             : Set a return probe
 27   -:[GRP/]EVENT                                         : Clear a probe
 28 
 29  GRP            : Group name. If omitted, use "kprobes" for it.
 30  EVENT          : Event name. If omitted, the event name is generated
 31                   based on SYM+offs or MEMADDR.
 32  MOD            : Module name which has given SYM.
 33  SYM[+offs]     : Symbol+offset where the probe is inserted.
 34  MEMADDR        : Address where the probe is inserted.
 35 
 36  FETCHARGS      : Arguments. Each probe can have up to 128 args.
 37   %REG          : Fetch register REG
 38   @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
 39   @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
 40   $stackN       : Fetch Nth entry of stack (N >= 0)
 41   $stack        : Fetch stack address.
 42   $retval       : Fetch return value.(*)
 43   +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
 44   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
 45   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
 46                   (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
 47                   are supported.
 48 
 49   (*) only for return probe.
 50   (**) this is useful for fetching a field of data structures.

Since we are going to have to access nested members of struct file we'll need to know their offsets relative to their containing structs. The C standard library has a really useful macro called offsetof which can tell us exactly that, but since code running in the kernel can't access the standard library we'll need to use the kernel's own implementation of offsetof which is in include/linux/stddef.h. Obviously, if you're less lazy than I am, then it's also possible to just do the offset calculations in your head. But the lazy-people's way to get them is by writing a super minimalist kernel module that prints the offsets into dmesg as debug messages:

  1 #include <linux/module.h>
  2 #include <linux/fs.h>
  3 #include <linux/kernel.h>
  4 #include <linux/init.h>
  5 #include <linux/printk.h>
  6 #include <linux/stddef.h>
  7
  8 static int __init get_offset_init(void)
  9 {
 10         pr_debug("offset of f_inode in struct file is %ld", offsetof(struct file, f_inode));
 11         pr_debug("offset of i_ino in struct inode is %ld", offsetof(struct inode, i_ino));
 12         pr_debug("offset of i_sb in struct inode is %ld", offsetof(struct inode, i_sb));
 13         pr_debug("offset of s_dev in struct super_block is %ld", offsetof(struct super_block, s_dev));
 14 
 15         return 0;
 16 }
 17 
 18 static void __exit get_offset_cleanup(void)
 19 {
 20 }
 21 
 22 module_init(get_offset_init);
 23 module_exit(get_offset_cleanup);

Once built and loaded it shows us what we need to know in the kernel's debug output.

root@ubuntu:~/get_offsetof# insmod ./offsetof.ko
root@ubuntu:~/get_offsetof# dmesg
[19280.887103] offset of f_inode in struct file is 32
[19280.887106] offset of i_ino in struct inode is 64
[19280.887108] offset of i_sb in struct inode is 40
[19280.887109] offset of s_dev in struct super_block is 16

Since we are going to calculate the offsets relative to the pointer value of the first function argument (struct file *file) the address to add the offsets to should be in the register %di, just as this StackOverflow post nicely describes.

Using the above quoted syntax documentation we can now define custom variables for our Kprobe event that extract i_ino and s_dev from struct file by writing it's definition into the file kprobe_events.

# inode is at file->f_inode->i_ino with offsets:
#   f_inode: 32
#   i_ino: 64
# inode=+64(+32(%di)):u64
#
# device is at file->f_inode->i_sb->s_dev with offsets:
#   f_inode: 32
#   i_sb: 40
#   s_dev: 16
# dev=+16(+40(+32(%di))):u32
#
# Defining the kprobe event and the custom variables by
# adding the above calculated offsets to the address
# in %di
root@mst-nb1:/sys/kernel/debug/tracing# \
    echo "p:kprobes/vfs_write vfs_write inode=+64(+32(%di)):u64 dev=+16(+40(+32(%di))):u32" \
    > ./kprobe_events

# The defined event appears in the /events folder and it's
# format file shows the custom variables
root@mst-nb1:/sys/kernel/debug/tracing# cat ./events/kprobes/vfs_write/format 
name: vfs_write
ID: 1914
format:
 field:unsigned short common_type; offset:0; size:2; signed:0;
 field:unsigned char common_flags; offset:2; size:1; signed:0;
 field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
 field:int common_pid; offset:4; size:4; signed:1;

 field:unsigned long __probe_ip; offset:8; size:8; signed:0;
 field:u64 inode; offset:16; size:8; signed:0;
 field:u32 dev; offset:24; size:4; signed:0;

print fmt: "(%lx) inode=0x%Lx dev=0x%x", REC->__probe_ip, REC->inode, REC->dev

Testing!

First we create a test file and get it's inode number.

root@mst-nb1:~# touch /tmp/testfile

# get the inode number of the test file
root@mst-nb1:~# ls -i /tmp/testfile
11014706 /tmp/testfile

Next we need to calculate the value that dev_t should have. The kernel uses the following macro to merge the major and minor numbers into a variable of type dev_t.

# include/linux/kdev_t.h
...
  6 #define MINORBITS       20
  7 #define MINORMASK       ((1U << MINORBITS) - 1)
  8 
  9 #define MAJOR(dev)      ((unsigned int) ((dev) >> MINORBITS))
 10 #define MINOR(dev)      ((unsigned int) ((dev) & MINORMASK))
 11 #define MKDEV(ma,mi)    (((ma) << MINORBITS) | (mi))
...

Well that bit shift should be simple enough to reimplement, so let's shift some bits!

I'll use Python to make those examples shorter:

# Getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where the file
# to monitor resides
mst@mst-nb1:~$ sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:1

# Reimplementing the kernel MKDEV macro with the input of 252 and 1,
# the result should be what we need to look for as dev_t
mst@mst-nb1:~$ python -c 'print(252 << 20 | 1)'
264241153

Next we create the filter for our inode number and dev_t value.

# Filter inode number 11014706 and a dev_t value 264241153
root@mst-nb1:/sys/kernel/debug/tracing# \
    echo "inode==11014706 && dev==264241153" > events/kprobes/vfs_write/filter

# Enable tracing
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > events/kprobes/vfs_write/enable 
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > tracing_on

# On terminal 2 I write into /tmp/testfile
root@mst-nb1:~# python -c 'fd = open("/tmp/testfile", "w"); fd.write("test"); fd.close()'

# Back on terminal 1 we check the content of trace
root@mst-nb1:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          python-6559  [004] d... 1406152.334025: vfs_write: \
            (vfs_write+0x0/0x1a0) inode=0xa81232 dev=0xfc00001

There's exactly one line showing our Python processes calling vfs_write and we didn't have to taint the kernel with custom modules or install any non-standard tools to get to that result!

Conclusion

There are many answers to the above question, some might be better and some might be less good, but the only way to see the trade offs is to be familiar with all the concepts. Obviously the described tools can also be used for more than only monitoring write operations, that was only a basic example.

Writing this article was a great exercise for me and I'd be happy about feedback with suggestions for improvements or other solutions.

2016/01/30

Monitoring a file to get details about write operations