Today I've been answering questions on unix.stackexchange.com, when I've met one which at first sight seemed easy to answer, but after rethinking it a few times I realized that actually I didn't know the proper answer (That's the question in question). This made me investigate and led me down a rabbit hole which consumed most of this afternoon, and the next one, and some more time. The question asks about how to figure out which process is writing to a certain log file. The person asking can see that something is writing to it by reading the file, but she doesn't know which process is writing it. Obviously there are some mechanisms in Syslog which could solve that problem if the file is written by Syslog, but out of curiosity I wanted to know if there's a generic mechanism that works for any file, so I'm going to go through a few possible solutions without concluding on a best one.
Inotify
One of the first things that came to my mind when I read the question was Inotify. Inotify is a mechanism that's provided by the Linux kernel via a set of system calls, it can be used to set watchers on files and directories. If one of the watched files gets modified in a way that the watcher is waiting for then an event gets passed to the user space process. This is great, I can use it to put a watcher on a file and make it wait for anything writing to that file. There are many different wrappers around the Inotify API, one of my favorite ones is a CLI utility called
inotifywatch
which comes with the package inotify-tools
.
mst@mst-nb1:~$ inotifywait 123
Setting up watches.
Watches established. # now i write to "123" in another terminal
123 MODIFY
Ok, so we got notified that there was a write event on the file, but we still don't know what has written to it. Unfortunately the Inotify API does not give any hints regarding which process has done a given modification on a file. Considering that the reading of the events in user space happens asynchronously this makes sense, because at the time the event is processed by the user space application the writing process might already have exited. The alternative would be to copy a lot of information about the process into the Inotify event structure, but this would increase it's size drastically and would hence decrease the performance.
For reference, this is the structure that describes an Inotify event:
# include/uapi/linux/inotify.h
...
14 /*
15 * struct inotify_event - structure read from the inotify device for each event
16 *
17 * When you are watching a directory, you will receive the filename for events
18 * such as IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, ..., relative to the wd.
19 */
20 struct inotify_event {
21 __s32 wd; /* watch descriptor */
22 __u32 mask; /* watch mask */
23 __u32 cookie; /* cookie to synchronize two events */
24 __u32 len; /* length (including nulls) of name */
25 char name[0]; /* stub for possible name */
26 };
...
Lsof
My next choice when trying to figure out what's writing to a file is the
lsof
utility. lsof
is great to see any kind of status information about running processes and the file descriptors they're currently holding open.
# In terminal 1 I create a cat process which holds an open file descriptor
# to /tmp/abc and sits there waiting for input
mst@mst-nb1:~$ cat > /tmp/abc
# In terminal 2 I run lsof on /tmp/abc
mst@mst-nb1:~$ lsof /tmp/abc
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cat 18420 mst 1w REG 252,1 0 11010067 /tmp/abc
The example shows that
lsof
correctly displays that there's a cat
process holding file descriptor with the number 1
in write mode and that this file descriptor refers to the file /tmp/abc
.
This is great, it shows us what we want to know. But what if the process that modifies the file only runs for 0.01 seconds and then exits or closes the file again? Even when applying the maximum of my typing skills while my finger joints are at room temperature I'd still have a good chance to miss it. Due to the way
lsof
collects it's data there is no way to watch a file for modifications over a time span, it can only capture snapshots. It's repeat mode, when passing the -r
option, doesn't help us much because it only captures multiple snapshots at an interval. Looks like we still haven't found a really good solution to our problem.
LoggedFS
Let's get a little more exotic and venture into the land of non-standard utilities. Loggedfs is a fuse based file system, that means it runs in user space, and it greatly increases the verbosity of all operations running on the file system while still letting the "real" file system do it's job underneath. Luckily the Ubuntu repositories provide it as a package under the name
loggedfs
, so it's easy to install. That's how that looks like:
# In terminal 1 I'm mounting Loggedfs on an existing directory while
# running it in foreground to directly see the output
mst@ubuntu:~$ loggedfs -f /tmp/testdir
09:30:27 (src/loggedfs.cpp:678) LoggedFS not running as a daemon
09:30:27 (src/loggedfs.cpp:818) LoggedFS starting at /tmp/testdir.
09:30:27 (src/loggedfs.cpp:843) chdir to /tmp/testdir
# In terminal 2 I'm writing to one of the already existing files
mst@ubuntu:~$ echo "bla" > /tmp/testdir/abc
# Back in terminal 1 Loggedfs shows me all the details about what just happened
09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
09:30:29 (src/loggedfs.cpp:138) open writeonly /tmp/testdir/abc \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
09:30:29 (src/loggedfs.cpp:138) truncate /tmp/testdir/abc to 0 bytes \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
09:30:29 (src/loggedfs.cpp:138) getattr /tmp/testdir/abc \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
09:30:29 (src/loggedfs.cpp:138) write 4 bytes to /tmp/testdir/abc at offset 0 \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
09:30:29 (src/loggedfs.cpp:138) 4 bytes written to /tmp/testdir/abc at offset 0 \
{SUCCESS} [ pid = 20428 -bash uid = 1000 ]
This is great, the output shows us exactly which process has written to our file and we could hardly wish for more details. But there are still some things to consider:
- Mounting only works with an entire directory, it does not work on single files, so exclude and include patterns have to be used to specify which file exactly needs to be monitored.
- Depending on company policy it might be problematic to install non-default packages on production systems just for troubleshooting.
- Some people might dislike the idea of having to run fuse on a production box.
With LoggedFS we've found a really nice and workable solution which tells us what we wanted to know. But actually the Kernel already has all the information that we want, so why would we have to run something else on top of it to collect the same data a second time? Let's see if by digging down into the Kernel we can't find some mechanism that doesn't require external packages or user space tools and still provides us with all the information we are looking for.
SystemTap
When starting to look at the more advanced kernel features it gets interesting, I'll start with SystemTap. SystemTap is a utility that originally comes from RedHat, but it's Open Source and usable on all major distros. It makes it easy to generate simple kernel modules which can be used to inspect what's happening in the kernel by creating event handlers for various types of events.
Besides the ability to attach event handlers to most kernel functions which are exported as symbols, SystemTap comes with a large collection of "tapsets". Tapsets are predefined scripts and events that can be used in custom tap scripts. One of those available events is called
vfs.write
. vfs
stands for Virtual File System and it's basically an abstraction layer on top of the actual file system implementation to handle generic operations like opening, closing, writing and reading files so they don't need to be reimplemented by each file system implementation again and again.
Once SystemTap is installed, the common way to use it is by writing script files with the mime type
stp
which are then read by the stap
utility. This utility translates them into C code, compiles them as a kernel module and then loads them into the kernel. The structure of the scripts is usually rather simple, they can set probes on events and they define functions to make code reusable.
Identifying what we're looking for
In our case we want to monitor the
write
operations on a given file, so first we'll need to figure out how to identify a specific file in a vfs.write
event. When querying the stap
utility for the available variables provided by the event vfs.write
we can see that there is a file
variable.
root@ubuntu:~# stap -L vfs.write
vfs.write file:long pos:long buf:long bytes_to_write:long dev:long devname:string \
ino:long name:string argstr:string $file:struct file* $buf:char const* \
$count:size_t $pos:loff_t*
Alternatively we can also just look at the definition of
vfs.write
to get even more details.
# /usr/local/systemtap/share/systemtap/tapset/linux/vfs.stp
...
950 probe vfs.write = kernel.function("vfs_write")
951 {
952 file = $file
953 pos = $pos
954 buf = $buf
955 bytes_to_write = $count
956 dev = __file_dev($file)
957 devname = __find_bdevname(dev, __file_bdev($file))
958 ino = __file_ino($file)
959
960 name = "vfs.write"
961 argstr = sprintf("%d, %d, %p", $count, $pos, $buf)
962 }
...
But how to uniquely identify a file based on the available variables? Each file on the file system is associated with an inode number and each inode number is unique within it's file system, so within a file system we can use the inode number to identify a file. To identify the file system we can use the associated device's major and minor numbers, this means that the combination of ma/mi numbers plus the inode number should be unique in the whole system and give us a reliable identifier that we can use to identify any file.
In the kernel the major and minor numbers are merged using a bit shift operation and a logical
or
, the result is a device identifier which the vfs.write
event provides as variable dev
. The inode number is available as variable ino
. One fact that will be important in a later chapter is that all of that data gets extracted from the file
variable by the two functions __file_dev
and __file_ino
.
In order to be able to look for a certain value in the
dev
variable we need to reproduce the merge of major and minor numbers that the kernel does internally, luckily SystemTap comes with a function called MKDEV
which does exactly that.
Let's try that!
Now we have all the information we need to write our
stap
script:
# write.stp
#
# monitor writes on file
probe vfs.write
{
if (dev == MKDEV($1,$2) && ino == $3) {
printf("got a write from %s (%d)\n", execname(), pid());
}
}
The
$1
, $2
and $3
are arguments which must be provided on the command line when running the script.
root@ubuntu:~# touch /tmp/testfile
# get inode number of testfile
root@ubuntu:~# ls -i /tmp/testfile
14 /tmp/testfile
# getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where /tmp/testfile resides.
root@ubuntu:~# sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:0
# pass major, minor and inode numbers to stap script
root@ubuntu:~# stap ./write.stp 252 0 14
# on terminal 2 we write something into the testfile
root@ubuntu:~# echo "test" > /tmp/testfile
# back on terminal 1 we're getting output:
root@ubuntu:~# stap ./write.stp 252 0 14
got a write from bash (868)
Great, we're monitoring the file for write operations and for each occurring operation we can get the process name and pid of the process that's writing! SystemTap, with all the tapsets that it ships with, really makes investigating kernel internals relatively easy and this by itself is quite a feature. It also doesn't require you to install it's entire tool chain on every system that you want to use it on because after SystemTap built your kernel module on one machine you can copy the module to every other machine with a compatible kernel and load it there. Unfortunately it can currently still be slightly troublesome to set the tool chain up on Debian based distributions like Ubuntu and it may require a small amount of hacking to make it work, on RedHat it should work out of the box though.
Seems we've found another pretty good solution for our problem, but I wish there was an easy way to achieve our goal of monitoring the writes to a file without having to load a custom kernel module on a production server... Let's try Auditd instead.
Auditd
Just like SystemTap, Auditd originated from RedHat but is Open Source and usable on other distributions too. It consists of several parts: The data collection happens inside the Kernel via a bunch of hooks placed at critical locations, then the collected data is first filtered through a set of filters that live inside the Kernel space, the resulting data is sent to a user space daemon (
auditd
) via a netlink
socket (man 7 netlink
). auditd
then communicates with other user space processes which want to access the collected data or modify filter rules, it also maintains log files containing the collected data for other tools to read. If user space processes submit commands to modify the filter rules to auditd
it uses the same netlink
socket to configure the in-kernel data filters. The RedHat documentation illustrates that architecture nicely.
The easiest way to setup filter rules is by using the CLI utility
auditctl
, which comes together with auditd
. When reading man 8 auditctl
it becomes obvious that this is a quite mature and powerful tool to configure a wide range of settings and filters, but let's first focus on our rather simple task of watching writes on a file. Here are the relevant parts of auditctl
's man page:
...
-w path
Insert a watch for the file system object at path. You cannot insert a watch to the
top level directory. This is prohibited by the kernel. Wildcards are not supported
either and will generate a warning. The way that watches work is by tracking the
inode internally. If you place a watch on a file, its the same as using the -F path
option on a syscall rule. If you place a watch on a directory, its the same as using
the -F dir option on a syscall rule. The -w form of writing watches is for backwards
compatibility and the syscall based form is more expressive. Unlike most syscall
auditing rules, watches do not impact performance based on the number of rules sent
to the kernel. The only valid options when using a watch are the -p and -k. If you
need to anything fancy like audit a specific user accessing a file, then use the
syscall auditing form with the path or dir fields. See the EXAMPLES section for an
example of converting one form to another.
...
-k key Set a filter key on an audit rule. The filter key is an arbitrary string of text
that can be up to 31 bytes long. It can uniquely identify the audit records produced
by a rule. Typical use is for when you have several rules that together satisfy a
security requirement. The key value can be searched on with ausearch so that no mat‐
ter which rule triggered the event, you can find its results. The key can also be
used on delete all (-D) and list rules (-l) to select rules with a specific key. You
may have more than one key on a rule if you want to be able to search logged events
in multiple ways or if you have an audispd plugin that uses a key to aid its analy‐
sis.
...
-p [r|w|x|a]
Describe the permission access type that a file system watch will trigger on.
r=read, w=write, x=execute, a=attribute change. These permissions are not the stan‐
dard file permissions, but rather the kind of syscall that would do this kind of
thing. The read & write syscalls are omitted from this set since they would over‐
whelm the logs. But rather for reads or writes, the open flags are looked at to see
what permission was requested.
...
Let's try that with the file
/tmp/testfile
that we've already used in the SystemTap test. First I'll set up a rule that places a watcher on writes to that file, then I'll write into it, and finally I'll use the ausearch
utility to search through the audit log files for the key I've defined in the watcher rule. The ausearch
utility gets delivered as part of the auditd
package.
# Create a rule to watch writes on /tmp/testfile and mark them
# with the key "testfile"
mst@mst-nb1:~$ sudo auditctl -w /tmp/testfile -p w -k testfile
# Verify that the rule has been added
mst@mst-nb1:~$ sudo auditctl -l
-w /tmp/testfile -p w -k testfile
# Write into the test file
mst@mst-nb1:~$ echo "test" >> /tmp/testfile
# Search log files for the key "testfile"
mst@mst-nb1:~$ sudo ausearch -k testfile
----
time->Sat Feb 6 15:52:28 2016
type=CONFIG_CHANGE msg=audit(1454745148.692:36790): auid=4294967295 ses=4294967295 \
op="add_rule" key="testfile" list=4 res=1
----
time->Sat Feb 6 15:52:36 2016
type=PROCTITLE msg=audit(1454745156.808:36794): proctitle="bash"
type=PATH msg=audit(1454745156.808:36794): item=1 name="/tmp/testfile" inode=11012788 \
dev=fc:01 mode=0100777 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PATH msg=audit(1454745156.808:36794): item=0 name="/tmp/" inode=11010049 \
dev=fc:01 mode=041777 ouid=0 ogid=0 rdev=00:00 nametype=PARENT
type=CWD msg=audit(1454745156.808:36794): cwd="/home/mst"
type=SYSCALL msg=audit(1454745156.808:36794): arch=c000003e syscall=2 success=yes \
exit=3 a0=2298f88 a1=441 a2=1b6 a3=7ffe7bb3eb48 items=2 ppid=30721 pid=32347 \
auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 \
sgid=1000 fsgid=1000 tty=pts2 ses=4294967295 comm="bash" exe="/bin/bash" \
key="testfile"
Nice, we got exactly the output that we were looking for again! The
auditd
way of creating those watcher rules is really amazingly powerful, user friendly and easy to use. Personally I think that the auditd
utility should be at the front of every operations guy's utility belt, somewhere between tcpdump
and strace
.
I could be satisfied with all the already discovered solutions here, but I'm still curious if there isn't any way to get that data without having to use any user space utilities and just directly surgically extract them from the Kernel. That's where Ftrace and Kprobe come in.
Ftrace & Kprobe
Ftrace is a Linux Kernel internal tracer that captures information about the current context when certain events occur and conditions are met, it exports it's control interfaces via a virtual file system called
tracefs
which is usually mounted in the directory tracing
of the debugfs
file system. By default most distributions mount debugfs
on /sys/kernel/debug
so Ftrace's files are in /sys/kernel/debug/tracing
. From now on, when I refer to paths of tracefs
files, their paths should always be interpreted as relative to the mount point of tracefs
. When browsing through that directory it might at first seem overwhelming due to the large number of files and directories, but luckily the kernel comes with a really great documentation.
Basic event tracking with Ftrace
The list of available tracers is in the file
/available_tracers
, but since we only want some basic information and do not need any advanced tracing mechanisms the simple nop
tracer is sufficient for us. It can be selected by writing it's name into the file /current_tracer
.
root@ubuntu:~# cat /sys/kernel/debug/tracing/available_tracers
blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
root@ubuntu:~# echo -n "nop" > /sys/kernel/debug/tracing/current_tracer
Before we start capturing anything let's first make sure the tracing is disabled and then clear the current trace output, that way we won't have any other output cluttering our trace output. The tracing can be enabled/disabled by writing
1
/0
into the file /tracing_on
. The trace output can be cleared by truncating the file /trace
, which will later contain our trace output.
root@ubuntu:~# echo -n "0" > /sys/kernel/debug/tracing/tracing_on
root@ubuntu:~# echo -n "" > /sys/kernel/debug/tracing/trace
Ftrace knows a large number of events and it allows us to choose which events we're interested in. The events are shown to the user in two ways, as a directory hierarchy in the directory
/events
and as a long list in the file /available_events
. Because I like the hierarchic structure that the /events
directory uses I'll stick to that one for now.
On the first level of directories in
/events
there is one sub directory per type of event, since we're looking for write operations I'll first focus on the system calls in /events/syscalls
. Doing an ls
on /events/syscalls
shows that for each system call there is one event when entering the call and one when exiting it, so let's look at the content of /events/syscalls/sys_enter_write
.
root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_accept4
drwxr-xr-x 2 root root 0 Jan 26 13:47 sys_enter_access
...
root@ubuntu:/sys/kernel/debug/tracing# ls -l ./events/syscalls/sys_enter_write
total 0
-rw-r--r-- 1 root root 0 Jan 26 13:47 enable
-rw-r--r-- 1 root root 0 Jan 26 13:47 filter
-r--r--r-- 1 root root 0 Jan 26 13:47 format
-r--r--r-- 1 root root 0 Jan 26 13:47 id
-rw-r--r-- 1 root root 0 Jan 26 13:47 trigger
Each of the event directories contains the same set of files and unsurprisingly
enable
is used to disable/enable the tracing of this event. Let's first disable the tracing of all events and then only enable that one that we're interested in. For convenience the upper level directories /events/syscalls
and /events
also contain a file enable
which can be used to propagate a change through all the sub directories. The file /set_event
always contains a list of the currently enabled events (and it could also be used to control the enabling/disabling of events) so we read that file to verify that now the only selected event is sys_enter_write
.
root@ubuntu:/sys/kernel/debug/tracing# echo -n "0" > events/enable
root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > events/syscalls/sys_enter_write/enable
root@ubuntu:/sys/kernel/debug/tracing# cat set_event
syscalls:sys_enter_write
Seems we're ready to start tracing, but we don't want to get flooded with data, so we'll only enable the tracing for about 5 seconds. Then we should be able to see the output of the trace in the file
/trace
and depending on how busy your system is there might be quite a lot of data.
root@ubuntu:/sys/kernel/debug/tracing# echo -n "1" > tracing_on; sleep 5; echo -n "0" > tracing_on
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 108/108 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
sshd-1261 [000] .... 3360.093713: sys_write(fd: 3, buf: 7f7e25a22de0, count: 24)
sshd-1198 [000] .... 3360.699635: sys_write(fd: a, buf: 7f568d193d90, count: 1)
bash-1284 [000] .... 3360.699682: sys_write(fd: 2, buf: 76c008, count: 1)
sshd-1198 [000] .... 3360.699699: sys_write(fd: 3, buf: 7f568d178680, count: 24)
sshd-1198 [000] .... 3360.778897: sys_write(fd: a, buf: 7f568d193d90, count: 1)
bash-1284 [000] .... 3360.778941: sys_write(fd: 2, buf: 76c008, count: 1)
Cool, so we get some data there! Unfortunately we don't know what that is, because the only file identifier given is the
fd
for file descriptor. Since file descriptors are only unique in process context we can't use them to filter for the specific file we're interested in because we don't know which process is writing to that file, that's what we're trying to find out! To solve that problem we can use Kprobe to define custom events and variables to filter by.
Custom events with Kprobe
Kprobes work by placing breakpoint instructions at specific places in the kernel and then collect data associated to the current context, if you want to know all the tiny details about how it works then the kernel documentation shall satisfy your curiosity. Defined Kprobes can be used by Ftrace just like any other event and they're controlled using the same files as described above, but the important feature that we need right now is that Kprobe allows us to dereference function arguments in order to later filter by members of structures in those arguments. This will help us to filter out events on a specific file.
Due to the flexibility of Kprobe we can get information about basically anything that happens in the kernel, we just need to know what we're looking for and how to filter it out of all the other stuff happening. Let's revert back to the function
vfs_write
that we've already used with SystemTap to monitor writes. If SystemTap is able to retrieve the criteria we require from the function arguments of vfs_write
, then there should be some way how we can do the same with Kprobe if we just figure out where that data resides.
Looking at the kernel code, we can see in the function definition of
vfs_write
that the file
variable is of type struct file
.
# fs/read_write.c:547
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
You might remember from the SystemTap example above that the
vfs.write
tap extracted all the information about the inode number and the associated device's major and minor numbers from the file argument, so we need to inspect the kernel code that defines the involved structures to figure out where that data is.
# include/linux/fs.h
...
584 struct inode {
585 umode_t i_mode;
586 unsigned short i_opflags;
587 kuid_t i_uid;
588 kgid_t i_gid;
589 unsigned int i_flags;
590
591 #ifdef CONFIG_FS_POSIX_ACL
592 struct posix_acl *i_acl;
593 struct posix_acl *i_default_acl;
594 #endif
595
596 const struct inode_operations *i_op;
597 struct super_block *i_sb;
598 struct address_space *i_mapping;
599
600 #ifdef CONFIG_SECURITY
601 void *i_security;
602 #endif
603
604 /* Stat data, not accessed from path walking */
605 unsigned long i_ino;
...
839 struct file {
840 union {
841 struct llist_node fu_llist;
842 struct rcu_head fu_rcuhead;
843 } f_u;
844 struct path f_path;
845 struct inode *f_inode; /* cached value */
...
1289 struct super_block {
1290 struct list_head s_list; /* Keep this first */
1291 dev_t s_dev; /* search index; _not_ kdev_t */
...
What we see in those
struct
definitions is that the inode number is available from file->f_inode->i_ino
as a plain unsigned long
and the major and minor numbers are accessible as file->f_inode->i_sb->s_dev
as type dev_t
. The kernel uses macros to convert dev_t
typed variables into major and minor numbers and back, we'll come back to that in a minute.
Defining the filter criteria by using custom variables
The kprobetrace documentation comes with a section describing the syntax used to define kprobe events and their custom variables via
tracefs
.
23 Synopsis of kprobe_events
24 -------------------------
25 p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
26 r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
27 -:[GRP/]EVENT : Clear a probe
28
29 GRP : Group name. If omitted, use "kprobes" for it.
30 EVENT : Event name. If omitted, the event name is generated
31 based on SYM+offs or MEMADDR.
32 MOD : Module name which has given SYM.
33 SYM[+offs] : Symbol+offset where the probe is inserted.
34 MEMADDR : Address where the probe is inserted.
35
36 FETCHARGS : Arguments. Each probe can have up to 128 args.
37 %REG : Fetch register REG
38 @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
39 @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
40 $stackN : Fetch Nth entry of stack (N >= 0)
41 $stack : Fetch stack address.
42 $retval : Fetch return value.(*)
43 +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
44 NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
45 FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
46 (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
47 are supported.
48
49 (*) only for return probe.
50 (**) this is useful for fetching a field of data structures.
Since we are going to have to access nested members of
struct file
we'll need to know their offsets relative to their containing structs. The C standard library has a really useful macro called offsetof
which can tell us exactly that, but since code running in the kernel can't access the standard library we'll need to use the kernel's own implementation of offsetof
which is in include/linux/stddef.h. Obviously, if you're less lazy than I am, then it's also possible to just do the offset calculations in your head. But the lazy-people's way to get them is by writing a super minimalist kernel module that prints the offsets into dmesg
as debug messages:
1 #include <linux/module.h>
2 #include <linux/fs.h>
3 #include <linux/kernel.h>
4 #include <linux/init.h>
5 #include <linux/printk.h>
6 #include <linux/stddef.h>
7
8 static int __init get_offset_init(void)
9 {
10 pr_debug("offset of f_inode in struct file is %ld", offsetof(struct file, f_inode));
11 pr_debug("offset of i_ino in struct inode is %ld", offsetof(struct inode, i_ino));
12 pr_debug("offset of i_sb in struct inode is %ld", offsetof(struct inode, i_sb));
13 pr_debug("offset of s_dev in struct super_block is %ld", offsetof(struct super_block, s_dev));
14
15 return 0;
16 }
17
18 static void __exit get_offset_cleanup(void)
19 {
20 }
21
22 module_init(get_offset_init);
23 module_exit(get_offset_cleanup);
Once built and loaded it shows us what we need to know in the kernel's debug output.
root@ubuntu:~/get_offsetof# insmod ./offsetof.ko
root@ubuntu:~/get_offsetof# dmesg
[19280.887103] offset of f_inode in struct file is 32
[19280.887106] offset of i_ino in struct inode is 64
[19280.887108] offset of i_sb in struct inode is 40
[19280.887109] offset of s_dev in struct super_block is 16
Since we are going to calculate the offsets relative to the pointer value of the first function argument (
struct file *file
) the address to add the offsets to should be in the register %di
, just as this StackOverflow post nicely describes.
Using the above quoted syntax documentation we can now define custom variables for our Kprobe event that extract
i_ino
and s_dev
from struct file
by writing it's definition into the file kprobe_events
.
# inode is at file->f_inode->i_ino with offsets:
# f_inode: 32
# i_ino: 64
# inode=+64(+32(%di)):u64
#
# device is at file->f_inode->i_sb->s_dev with offsets:
# f_inode: 32
# i_sb: 40
# s_dev: 16
# dev=+16(+40(+32(%di))):u32
#
# Defining the kprobe event and the custom variables by
# adding the above calculated offsets to the address
# in %di
root@mst-nb1:/sys/kernel/debug/tracing# \
echo "p:kprobes/vfs_write vfs_write inode=+64(+32(%di)):u64 dev=+16(+40(+32(%di))):u32" \
> ./kprobe_events
# The defined event appears in the /events folder and it's
# format file shows the custom variables
root@mst-nb1:/sys/kernel/debug/tracing# cat ./events/kprobes/vfs_write/format
name: vfs_write
ID: 1914
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __probe_ip; offset:8; size:8; signed:0;
field:u64 inode; offset:16; size:8; signed:0;
field:u32 dev; offset:24; size:4; signed:0;
print fmt: "(%lx) inode=0x%Lx dev=0x%x", REC->__probe_ip, REC->inode, REC->dev
Testing!
First we create a test file and get it's inode number.
root@mst-nb1:~# touch /tmp/testfile
# get the inode number of the test file
root@mst-nb1:~# ls -i /tmp/testfile
11014706 /tmp/testfile
Next we need to calculate the value that
dev_t
should have. The kernel uses the following macro to merge the major and minor numbers into a variable of type dev_t
.
# include/linux/kdev_t.h
...
6 #define MINORBITS 20
7 #define MINORMASK ((1U << MINORBITS) - 1)
8
9 #define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
10 #define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
11 #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))
...
Well that bit shift should be simple enough to reimplement, so let's shift some bits!
I'll use Python to make those examples shorter:
# Getting the major and minor numbers of the file system that's
# currently mounted on /, because that's where the file
# to monitor resides
mst@mst-nb1:~$ sed -rne 's|^[0-9]+ [0-9]+ ([0-9:]+) / / .*$|\1|p' /proc/self/mountinfo
252:1
# Reimplementing the kernel MKDEV macro with the input of 252 and 1,
# the result should be what we need to look for as dev_t
mst@mst-nb1:~$ python -c 'print(252 << 20 | 1)'
264241153
Next we create the filter for our inode number and
dev_t
value.
# Filter inode number 11014706 and a dev_t value 264241153
root@mst-nb1:/sys/kernel/debug/tracing# \
echo "inode==11014706 && dev==264241153" > events/kprobes/vfs_write/filter
# Enable tracing
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > events/kprobes/vfs_write/enable
root@mst-nb1:/sys/kernel/debug/tracing# echo 1 > tracing_on
# On terminal 2 I write into /tmp/testfile
root@mst-nb1:~# python -c 'fd = open("/tmp/testfile", "w"); fd.write("test"); fd.close()'
# Back on terminal 1 we check the content of trace
root@mst-nb1:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
python-6559 [004] d... 1406152.334025: vfs_write: \
(vfs_write+0x0/0x1a0) inode=0xa81232 dev=0xfc00001
There's exactly one line showing our Python processes calling
vfs_write
and we didn't have to taint the kernel with custom modules or install any non-standard tools to get to that result!
Conclusion
There are many answers to the above question, some might be better and some might be less good, but the only way to see the trade offs is to be familiar with all the concepts. Obviously the described tools can also be used for more than only monitoring write operations, that was only a basic example.
Writing this article was a great exercise for me and I'd be happy about feedback with suggestions for improvements or other solutions.