Sandboxing ChromeOS system services
In ChromeOS, OS-level functionality (such as configuring network interfaces) is implemented by a collection of system services and provided to Chrome over D-Bus. These system services have greater system and hardware access than the Chrome browser.
Separating functionality like this aims to prevent malicious websites from gaining access to OS-level functionality. If Chrome were able to directly control network interfaces, then a compromise in Chrome would give an attacker almost full control over the system. For example, by having a separate network manager, we can reduce the functionality exposed to an attacker to just querying interfaces and performing pre-determined actions on them.
ChromeOS uses a few different mechanisms to isolate system services from Chrome
and from each other. We use a helper program called Minijail (executable
minijail0
). In most cases, Minijail is used in the service's init script. In
other cases, the Minijail library is used if a service wants to apply
restrictions to the programs that it launches, or to itself.
These different sandboxing mechanisms are described in the ChromeOS sandboxing talk (internal only).
The “forbidden intersection”
The forbidden intersection is:
- Running as root, or with
CAP_SYS_ADMIN
, and - Running in the init PID and mount namespaces, and
- Running without a Seccomp policy and a Landlock policy, or without an enforcing SELinux domain.
You must avoid the forbidden intersection by having at least one of, preferably more than one of, and ideally all of:
- Running as a non-root user with no
CAP_SYS_ADMIN
— User IDs - Running in a non-init mount and PID namespace, or in a new PID namespace with a Landlock policy — Namespaces or Landlock
- A Seccomp policy and a Landlock policy, OR running with an SELinux policy in an enforcing domain — Seccomp filters and Landlock, or SELinux. Landlock can provide filesystem access restrictions similar to SELinux, but it must be combined with Seccomp to escape the forbidden intersection.
You don't normally need both Seccomp and SELinux but for very security-sensitive workloads this can be required.
Note on SELinux support
SELinux should be used exclusively when other sandboxing approaches are not feasible. Examples of this include:
- Services that need to run as root or with
CAP_SYS_ADMIN
, or - Services that require privileged access to the filesystem:
- Writing to root-owned files or directories, or
- Mounting paths where existing sandboxing tools cannot restrict mount paths.
Best practices for writing secure system services
Just remember that code has bugs, and these bugs can be used to take control of the code. An attacker can then do anything the original code was allowed to do. Therefore, code should only be given the absolute minimum level of privilege needed to perform its function.
Aim to keep your code lean, and your privileges low. Don't run your service as root. If you need to use third-party code that you didn't write, you should definitely not run it as root.
Use the libraries provided by the system/SDK. In ChromeOS, libchrome and libbrillo (née libchromeos) offer a lot of functionality to avoid reinventing the wheel, poorly. Don't reinvent IPC; use D-Bus or Mojo. Don't open listening sockets; connect to the required service.
Don't (ab)use shell scripts. Shell script logic is harder to reason about and shell command-injection bugs are easy to miss. If you need functionality separated from your main service, use programs written in a primary programming language like C++ or Rust, not shell scripts. Moreover, when you execute them, consider further restricting their privileges.
It is strongly recommended to enable CFI for code for code not written in a memory safe language whenever it handles untrusted inputs or runs without all three techniques mentioned in the Forbidden intersection section.
Just tell me what I need to do
- Create a new user (and optionally a new group) for your service. See CrOS user & group management.
- Use Minijail to run your service as the user (and group) created in the previous steps. See Minijail configuration.
- If your service fails, you might need to grant it capabilities. See Capabilities.
- Use as many namespaces as possible. See Namespaces.
- If possible, use
-n
to setno_new_privs
. - Consider reducing the kernel attack surface exposed to your service by using Seccomp filters. See Seccomp filters.
- Add your sandboxed service to the security.SandboxedServices test.
- Enable the
cfi
andthinlto
USE flags.
Minijail configuration
Minijail can be configured via command-line arguments or a configuration file. Prefer using a configuration file, as it's easier to read and can include comments.
Minijail configuration files should be installed to
/usr/share/minijail/
. An upstart service can invoke minijail with a
configuration file like this:
exec minijail0 --config /usr/share/minijail/my_service.conf -- /usr/bin/my_service
Example configuration:
% minijail-config-file v0
# Copyright 2024 The ChromiumOS Authors
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Enable network, ipc, cgroup, pid, and uts namespaces.
e
l
N
p
uts
# Minimal mount namespace.
profile=minimalistic-mountns
# If you don't need /dev, use minimalistic-mountns-nodev instead.
# Set no_new_privs.
n
# Enable seccomp policy.
S = /usr/share/policy/my_service-seccomp.policy
# Set user and group.
u = my_service
g = my_service
# Set up mounts. These will be specific to your service, these are just common examples:
# Mount a tmpfs at /dev and /run.
mount = /dev,/dev,tmpfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M
mount = /run,/run,tmpfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M
# Allow syslog.
bind-mount = /dev/log
# Allow D-Bus.
bind-mount = /run/dbus
# Allow mojo.
bind-mount = /run/mojo,,1
# Allow DNS resolution.
bind-mount = /run/dns-proxy
bind-mount = /run/shill
# Allow UMA metrics.
bind-mount = /var/lib/metrics,,1
User IDs
The first sandboxing mechanism is user IDs (UIDs). We try to run each service as its own UID, different from the root user, which allows us to restrict what files and directories the service can access, and also removes a big chunk of system functionality that's only available to root.
See CrOS user & group management for details of adding a new user.
To set the user, add u = <username>
to your service's minijail configuration.
chronos-access membership requires SELinux
The forbidden intersection notwithstanding, if a service accesses user data by
having its UID in the chronos-access
group, it must run with an enforcing
SELinux domain. These services are accessing data owned by the chronos user,
which could be malicious. A compromised Chrome browser could modify or corrupt
chronos-owned data and attempt to escalate privileges by exploiting or confusing
a service accessing this data.
SELinux is useful because it provides finer-grained control over what the service is allowed to do. Exploitation in these cases doesn't happen via memory corruption. Instead, the attacker will set up user data to confuse the service and trick it into performing valid operations on the wrong filesystem objects. This is usually referred to as a confused deputy attack. For example, a service might be tricked into mounting what it believes is a USB drive. In reality, however, it ends up mounting a virtual image on top of an existing file or directory, bypassing our write-XOR-execute restrictions.
Seccomp is not a great option to prevent this type of attack because it's not
granular enough. In the previous example, since the service is allowed to
perform mounts, the mount(2)
system call has to be allowed. Seccomp does not
have the ability to filter path arguments to system calls, so it would be
impossible to restrict path arguments to the mount call. SELinux, on the other
hand, allows us to restrict a service to only perform operations (like mount
calls) on specific paths in the filesystem.
Capabilities
Some programs, however, require some of the system access usually granted only
to the root user. We use Linux capabilities for this. Capabilities allow us
to grant a specific subset of root's privileges to an otherwise unprivileged
process. The link above has the full list of capabilities that can be granted
to a process. Some of them are equivalent to root, so we avoid granting those.
In general, most processes need capabilities to configure network interfaces,
access raw sockets, or performing specific file operations. Capabilities are
passed to Minijail using the -c
switch. permission_broker
, for example,
needs capabilities to be able to chown(2)
device nodes.
From
permission_broker.conf
:
start on starting system-services
stop on stopping system-services
respawn
# Run as <devbroker> user.
# Grant CAP_CHOWN and CAP_FOWNER.
exec minijail0 -u devbroker -c 'cap_chown,cap_fowner+eip' -- \
/usr/bin/permission_broker
Capabilities are expressed using the format that cap_from_text(3)
accepts.
Namespaces
Many resources in the Linux world can be isolated now such that a process has its own view of things. For example, it has its own list of mount points, and any changes it makes (unmounting, mounting more devices, etc...) are only visible to it. This helps keep a broken process from messing up the settings of other processes.
For more in-depth details, see the namespaces overview.
In ChromiumOS, we like to see every process/daemon run under as many unique namespaces as possible. Many are easy to enable/rationalize about: if you don't use a particular resource, then isolating it is straightforward. If you do rely on it though, it can take more effort.
Here's a quick overview. Use the command line option if the description below matches your service (or if you don't know what functionality it's talking about -- most likely you aren't using it!).
--profile=minimalistic-mountns
: This is a good first default that enables mount and process namespaces. This only mounts/proc
and creates a few basic device nodes in/dev
. If you need more things mounted, you can use the-b
(bind-mount) and-k
(regular mount) flags.--uts
: Just always turn this on. It makes changes to the host / domain name not affect the rest of the system.-e
: If your process doesn't need network access. This also isolates netlink and UNIX abstract sockets. Note: D-Bus and syslog use named UNIX sockets, so they will still be usable (as long as you bind-mounted them).-l
: If your process doesn't use SysV shared memory or IPC.-p
: If your process doesn't interact with other process PIDs (other than child processes).-N
: If your process doesn't need to modify common control groups settings.
When does a process need to run in the init mount or PID namespace?
Almost all processes do not need to run in the init namespace. Please contact chromeos-security@ for a consultation if you believe that your process needs to run in the init mount or PID namespace.
Passing common resources
When using many namespaces to isolate a service, there are some resources that the service still reasonably should be able to access.
If bind-mounting on top of /run, you need to mount a tmpfs /run:
-k 'none,/run,tmpfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M'
If bind-mounting on top of /sys, you need to mount a tmpfs /sys:
-k 'none,/sys,sysfs,MS_NODEV|MS_NOEXEC|MS_NOSUID,mode=755,size=10M'
- syslog: If using
-d
to mount a minimal /dev, you can pass access to the syslog daemon by using-b /dev/log
. If your process mounts all of /dev, you need to use-b /run/systemd/journal
since /dev/log is a symlink to /run/systemd/journal/dev-log. In either case, you do not need to specify the writable flag to-b
for this to work. This will work across all namespaces (including-e
network). - D-Bus: You can access the system D-Bus by using
-b /run/dbus
. You do not need to specify the writable flag to-b
for this to work. This will work across all namespaces (including-e
network and-p
PID). - Nameservers: If you need to resolve hostnames (DNS), you'll need to pass
-b /run/dns-proxy
as that daemon maintains the/etc/resolv.conf
file.
When does a process need to run in the Chrome mount namespace?
Note: Before utilizing these namespaces, please consult with the chromeos-security@ team to make sure it's used correctly.
ChromeOS Guest sessions run in an isolated mount namespace bound to the
path /run/namespaces/mnt_chrome
. This means Chrome runs in the non-init
mount namespace at the path and Cryptohome mounts user profile
directories and Daemon stores in this namespace. Regular sessions, on the
other hand, don't have this session isolation yet.
Any process that needs to access user data during a Guest session must run in the Chrome mount namespace. For regular sessions however the namespace isolation is not active. The namespace exists for all sessions, however regular user sessions setup the user home directories in the root mount namespace to handle data propagation between ARCVM, Linux VM and other system parts. Therefore processes must not enter the mount namespace during a regular user session.
Most processes that access user data utilize daemon stores, which are already mounted in the Chrome mount namespace. However, if a new process needs to access user data from user cryptohome by explicitly entering the Chrome mount namespace by calling setns(2) or nsenter(1), it can do so by querying the state of the session isolation from the browser process. Here is an example CL for this approach.
Mount propagation type guidance
When creating a new mount or entering a new mount namespace, an important consideration is the mount propagation mode of the mount. Some background on mount types can be found in Linux kernel mount documentation, but a brief summary is:
- Shared (
MS_SHARED
) — allows mount/unmount events to flow in both directions between namespaces - Mounts-flow-in (
MS_SLAVE
) — only allows mount events to flow in from the parent namespace - Private (
MS_PRIVATE
) — allows no flow of mount events - Unbindable (
MS_UNBINDABLE
) — same as private + you also cannot bind-mount to this mount point
For how to change the mount propagation mode when entering a new mount namespace
see the section for -K[mode]
in minijail0(1)
.
When does a mount need to be shared?
Mounts need to be shared if and only if mount/unmount events need to flow between namespaces in both directions. If mounts only need to flow from one namespace to the other, then they must be shared on the parent namespace but can be mounts-flow-in on the child namespace. Making mounts shared increases the possibilities for interaction between processes that could undermine security of the system and user namespace separation.
Seccomp filters
Removing access to the filesystem and to root-only functionality is not enough to completely isolate a system service. A service running as its own UID and with no capabilities has access to a big chunk of the kernel API. The kernel therefore exposes a huge attack surface to non-root processes, and we would like to restrict what kernel functionality is available for sandboxed processes.
The mechanism we use is called Seccomp-BPF. Minijail can take a policy file
that describes what syscalls will be allowed, what syscalls will be denied, and
what syscalls will only be allowed with specific arguments. The full description
of the policy file language can be found in [the syscall_filter.c
source].
Abridged policy for
mtpd
on amd64 platforms:
# Copyright 2012 The ChromiumOS Authors
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
read: 1
ioctl: 1
write: 1
timerfd_settime: 1
open: 1
poll: 1
close: 1
# Don't allow mmap with both PROT_WRITE and PROT_EXEC.
mmap: arg2 in ~PROT_EXEC || arg2 in ~PROT_WRITE
mremap: 1
munmap: 1
# Don't allow mprotect with PROT_EXEC.
mprotect: arg2 in ~PROT_EXEC
lseek: 1
# Allow socket(domain==PF_LOCAL) or socket(domain==PF_NETLINK)
socket: arg0 == 0x1 || arg0 == 0x10
# Allow PR_SET_NAME from libchrome's base::PlatformThread::SetName()
prctl: arg0 == 0xf
Any syscall not explicitly mentioned, when called, results in the process
being killed. The policy file can also tell the kernel to fail the system call
(returning -1 and setting errno
) without killing the process:
# execve: return EPERM
execve: return 1
NOTE:
mmap
andmprotect
both have argument filters to prevent writeable executable memory since that makes certain classes of attacks much easier. In most casesmprotect
does not needPROT_EXEC
, but you might have to usearg2 in ~PROT_EXEC || arg2 in ~PROT_WRITE
just likemmap
in cases where child processes are executed and need to dynamically link shared libraries or the code implements a JIT compiler.
Generating Seccomp policies using audit (on 4.14+ kernels)
On kernels 4.14 and above we can use the new SECCOMP_RET_LOG
return value to
make policy generation easier. On these kernels, the -L
Minijail option will
use SECCOMP_RET_LOG
as the return value for blocked syscalls: those not listed
in the policy or whose arguments don't match the policy. Instead of killing the
process on a blocked syscall, the kernel will log the otherwise blocked syscall
but will effectively allow it.
The advantage of this mechanism versus what we have available in pre-4.14
kernels is that instead of having to add syscalls to the policy one by one, you
can run the process with -L
, get a list of all the syscalls not included in
the policy, review them, and automatically generate or augment a policy, all
in one step.
NOTE: Minijail's
-L
flag requires minijail to be built withUSE=cros-debug
. Generally, this means that it will not work out of the box in prebuilt (e.g Goldeneye, CPFE, etc.) OS images.
Our recommended way of using this functionality is to start with an empty
policy which will cause all syscalls to be logged-but-allowed. The resulting
audit logs (at /var/log/audit/audit.log*
) can then be parsed with the
generate_seccomp_policy.py script to automatically generate a policy. There's
a bit of extra setup required and some associated caveats. Please see the
detailed instructions at generate_seccomp_policy.py
's README section on
using Linux audit logs to generate policy.
This mechanism can also be combined with the strace
-based mechanism below:
run the process to be sandboxed under strace
, generate a base policy using
the policy generation script, and then refine it using -L
.
Generating a seccomp policy using strace
This is the old and familiar way of generating policies by inspecting syscalls
using strace
. It does not have any kernel version dependencies and also does
not require a minijail build with USE=cros-debug
. Similar to audit logs above,
the generate_seccomp_policy.py script can accept strace logs from an
unsandboxed process to generate a policy.
NOTE: When creating
strace
logs for arm64, make sure you're running it in arm64 userland as most devices that support arm64 kernels run 32-bit arm userland by default. The image running on the device should be built for 64-bit arm userland e.g. kevin device can run the image built for 32-bit arm userland with the--board=kevin
flag and run the image for 64-bit arm userland built with the--board=kevin64
flag. You can runfile -L /bin/sh
command to check which environment you're running on.
Generate and pre-process the strace log
strace -f -o strace.log <program>
When sandboxing a dynamically-linked executable, Minijail will default to using
LD_PRELOAD
to install the seccomp filter. This will install the filter
after glibc initialization, so remove the syscalls related to glibc
initialization to obtain a smaller filter (and a tighter sandbox). Those are
normally everything up to and including the following:
rt_sigaction(SIGRTMIN, {<sa_handler>, [], SA_RESTORER|SA_SIGINFO, <sa_restorer>}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {<sa_handler>, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, <sa_restorer>}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL) = <addr>
brk(<addr>) = <addr>
If you want to collect strace
logs for an existing service and you already
have a test device set up, the following steps can help especially if you are
extending an existing policy:
- Mount the root-fs read/write.
mount -o remount,rw /
- Edit the init config file in
/etc/init
- Temporarily disable seccomp if it is present.
- Add
strace -f -o /tmp/strace.log
before Minijail is invoked. Note that this will include all theminijail0
syscalls as well, but you can exclude them later.
- Reboot or reload the config
initctl reload-configuration
and restart the service. - Run tests or perform actions that exercise the features of the service.
- Collect the resulting file
/tmp/strace.log
.
Generate policy using strace.log
~/chromiumos/src/platform/minijail/tools/generate_seccomp_policy.py strace.log > $PROGRAM_NAME.policy
Testing and troubleshooting
-
Test the policy:
minijail0 -S seccomp.policy -L <cmd>
- If the policy is incomplete:
- (Kernel >=4.14) Minijail/audit will log all offending syscalls to
/var/log/audit/audit.log
(as discussed above). - (Kernel <4.14) Minijail will crash and log the first offending
syscall in
/var/log/messages
.
- (Kernel >=4.14) Minijail/audit will log all offending syscalls to
-
You should ensure that your service is executing as many of its code paths as possible when executing
strace
or auditing, in particular error paths. In some cases it may be easier to add syscalls manually (for instance,abort
) rather than forcing execution of those paths. -
When you're collecting strace logs to create a Seccomp policy, make sure you're in the targeted userland. You can check the environment by running
file -L /bin/sh
command. -
When not using the
-n
Minijail flag, privilege-dropping syscalls happen after the filter is installed, and they won't show up in normal program execution because they're called by Minijail itself. Because the audit log method uses Minijail, this is only applicable to thestrace
method. The following syscalls need to be added to the policy:setgroups(2)
,setresgid(2)
, andsetresuid(2)
for dropping root.capget(2)
,capset(2)
, andprctl(2)
for dropping capabilities.- Normally, it's just easier to use the
-n
flag (no_new_privs
), which prevents the sandboxed process from obtaining new privileges and is therefore a good addition for sandboxing.
-
Sometimes Minijail will fail to compile the seccomp filter with an error similar to:
WARNING minijail0[32315]: libminijail[32315]: trailing garbage after constant: 'LOOP_GET_STATUS64'
WARNING minijail0[32315]: libminijail[32315]: compile_atom: /usr/share/policy/e2fsck-seccomp.policy(13): invalid constant 'LOOP_GET_STATUS64'
WARNING minijail0[32315]: libminijail[32315]: could not allocate filter block
WARNING minijail0[32315]: libminijail[32315]: compile_filter: compile_file() failed
ERR minijail0[32315]: libminijail[32315]: failed to compile seccomp filter BPF program in '/usr/share/policy/e2fsck-seccomp.policy'
-
This means that one of the constant parameters provided to a syscall could not be resolved: e.g.
ioctl: arg1 == LOOP_GET_STATUS64
. -
To fix this, look up the hex value of the constant and substitute the constant e.g.
ioctl: arg1 == 0x4C05
. -
Minijail resolves these constants based on headers that are in gen_constants-inl.h.
-
Some syscalls may be triggered more deterministically on VMs than in tests on real hardware (e.g.
clock_gettime
andgettimeofday
). Developers should be on the lookout for such failures and the Seccomp policies should be adjusted accordingly.
Finding failing syscalls
If a process violates its seccomp policy, it'll be terminated with SIGSYS
(bad
system call) and you'll see a message like this in /var/log/messages
:
WARNING minijail0[1415]: libminijail[1415]: child process 1417 had a policy violation (/usr/share/policy/foo.policy)
There are a couple of ways to find out what syscall caused the violation:
-
If you can deploy a debug build of Minijail (the
chromeos-base/minijail
package built withUSE="cros-debug"
) to the device, and add-L
to the Minijail command. Information on failing syscalls will be logged to/var/log/audit/audit.log
. (Kernel version 4.14 or above is required for this to work.) Note that-L
also stops policy violations from crashing the process, so execution will succeed.Many other things are logged to audit.log, so to locate the messages related to your binary, look for messages starting
type=SECCOMP
. For example, here's a violation message caused by thetrue
command making syscall 157:type=SECCOMP msg=audit(1641860367.246:224): auid=0 uid=0 gid=0 ses=2 subj=u:r:minijail:s0 pid=13686 comm="true" exe="/usr/bin/coreutils" sig=0 arch=c000003e syscall=157 compat=0 ip=0x7a28525bde35 code=0x7ffc0000
(Note that longer command names may be truncated in the
comm=
field.) -
If you have a core dump or minidump (such as the
.core
file found in/var/spool/crash/
, or the minidump attached to a crash report), you can open it in a debugger and dump the register values (e.g. withinfo registers
in gdb orregister read
in lldb). Check the syscall calling conventions for your architecture to determine which registers contain the syscall number and arguments. (Note: 64-bit ARM systems (arm64, aarch64) run userspace programs in 32-bit mode, so you'll want to follow the arm calling conventions in that case.)
Once you have the syscall number, find out the name of the syscall by looking it
up our in online syscalls table or in minijail0 -H
(run on the same
architecture as the program you're debugging). You can then add the syscall to
your policy.
If you do not want to allow an entire syscall, you can only allow certain
parameters, e.g. ioctl: arg1 == FDGETPRM
. You can find the values of these
parameters from a core or minidump using a debugger, as described above.
Installing and applying the generated policy
The policy file needs to be installed in the system, so we need to add it to the ebuild file. For example:
# Install seccomp policy file.
insinto /usr/share/policy
use seccomp && newins "mtpd-seccomp-${ARCH}.policy" mtpd-seccomp.policy
And finally, the policy file has to be passed to Minijail, using the -S
option. Again, using mtpd as an example:
# use minijail (drop root, set no_new_privs, set seccomp filter).
# Mount /proc, /sys, /dev, /run/udev so that USB devices can be
# discovered. Also mount /run/dbus to communicate with D-Bus.
exec minijail0 -i -I -p -l -r -v -t -u mtp -g mtp -G \
-P /mnt/empty -b / -b /proc -b /sys -b /dev \
-k tmpfs,/run,tmpfs,0xe -b /run/dbus -b /run/udev \
-n -S /usr/share/policy/mtpd-seccomp.policy -- \
/usr/sbin/mtpd -minloglevel="${MTPD_MINLOGLEVEL}"
Securely mounting cryptohome daemon store folders
Some daemons store user data on the user's cryptohome under
/home/.shadow/<user_hash>/mount/root/<daemon_name>
(or equivalently
/home/root/<user_hash>/<daemon_name>
) and
/home/.shadow/<user_hash>/mount/root/.cache/<daemon_name>
(or equivalent
/home/root/<user_hash>/.cache/<daemon_name>
). For instance, Session Manager
stores user policy under /home/root/<user_hash>/session_manager/policy
. This
is useful if the data should be protected from other users since the user's
cryptohome is only mounted (and therefore decrypted) when the user logs in. If
the user is not logged in, it is encrypted with the user's password.
However, if a daemon is already running inside a mount namespace (minijail0 -v ...
) when the user's cryptohome is mounted, it does not see the mount since
mount events do not propagate into mount namespaces by default. This propagation
can be achieved, though, by making the parent mount a shared mount and the
corresponding mount inside the namespace a shared or MS_SLAVE
mount. See
shared subtrees.
To set up a cryptohome daemon store folder that propagates into your daemon's
mount namespace, add this code to the src_install
section of your daemon's
ebuild:
local daemon_store="/etc/daemon-store/<daemon_name>"
dodir "${daemon_store}"
fperms 0700 "${daemon_store}"
fowners <daemon_user>:<daemon_group> "${daemon_store}"
This directory is never used directly. It merely serves as a secure template for
the chromeos_startup
script, which picks it up and creates
/run/daemon-store/<daemon_name>
and /run/daemon-store-cache/<daemon_name>
as
shared mounts.
Next, move the user/group setup to pkg_setup()
since pkg_preinst()
, where
this is usually done, runs after src_install()
:
pkg_setup() {
# Has to be done in pkg_setup() instead of pkg_preinst() since
# src_install() needs <daemon_user> and <daemon_group>.
enewuser <daemon_user>
enewgroup <daemon_group>
cros-workon_pkg_setup
}
In your daemon's init script, the default is to mount with MS_SLAVE
for minijail0 (MS_PRIVATE
for libminijail) unless -K<mode>
is passed
when creating your mount namespace. Be sure not to mount all of /run
. Make
sure to mount with the MS_REC
flag to propagate any already-mounted cryptohome
bind mounts into the mount namespace.
minijail0 -v -Kslave \
-k 'tmpfs,/run,tmpfs,MS_NOSUID|MS_NODEV|MS_NOEXEC' \
-k '/run/daemon-store/<daemon_name>,/run/daemon-store/<daemon_name>,none,MS_BIND|MS_REC' \
-k '/run/daemon-store-cache/<daemon_name>,/run/daemon-store-cache/<daemon_name>,none,MS_BIND|MS_REC' \
...
During sign-in, when the user's cryptohome is mounted, Cryptohome creates
/home/.shadow/<user_hash>/mount/root/<daemon_name>
, bind-mounts it to
/run/daemon-store/<daemon_name>/<user_hash>
and copies ownership and mode from
/etc/daemon-store/<daemon_name>
to the bind target. Since
/run/daemon-store/<daemon_name>
is a shared mount outside of the mount
namespace and a MS_SLAVE
mount inside, the mount event propagates into the
daemon. Ditto for /home/.shadow/<user_hash>/mount/root/.cache/<daemon_name>
and /run/daemon-store-cache/<daemon_name>
Your daemon can now use /run/daemon-store/<daemon_name>/<user_hash>
to store
user data once the user's cryptohome is mounted. Note that even though
/run/daemon-store
is on a tmpfs, your data is actually stored on disk and not
lost on reboot.
/run/daemon-store-cache/<daemon_name>/<user_hash>
works similarly as above,
however contents inside this directory are automatically cleaned up when device
disk space is low, for both logged-in and logged-out users. This is the ideal
directory to store cache contents.
Be sure not to write to the folder before the cryptohome is mounted.
Consider listening to Session Manager's SessionStateChanged
signal or similar
to detect mount events. Note that /run/daemon-store/<daemon_name>/<user_hash>
might exist even though cryptohome is not mounted, so testing existence is not
enough (it only works the first time).
The <user_hash>
can be retrieved with cryptohome's GetSanitizedUsername
D-Bus method.
The following diagram illustrates the mount event propagation:
Landlock unprivileged filesystem access control
Landlock is a Linux Security Module that helps manage filesystem access, notably even for unprivileged processes. Minijail supports options to help manage a Landlock policy.
Policies consist of an allowlist of paths, and the specific permissions for a given path. Minijail includes the following flags to help set up a policy:
--fs-default-paths
: This is recommended for most services, and provides access to basic system resources, such as the ability to execute shared object libraries in/lib64
.--fs-path-ro
: Provides read-only access for a path.--fs-path-rx
: Provides read and execute access for a path.--fs-path-rw
: Provides read and basic write access for a path.--fs-path-advanced-rw
: Provides read and advanced write access for a path. In most cases, basic write access is sufficient, but if you need special capabilities such as creating symlinks, use this option. The full set of additional capabilities is:LANDLOCK_ACCESS_FS_MAKE_CHAR
LANDLOCK_ACCESS_FS_MAKE_DIR
LANDLOCK_ACCESS_FS_MAKE_REG
LANDLOCK_ACCESS_FS_MAKE_SOCK
LANDLOCK_ACCESS_FS_MAKE_FIFO
LANDLOCK_ACCESS_FS_MAKE_BLOCK
LANDLOCK_ACCESS_FS_MAKE_SYM
Creating useful policies
The objective of Landlock is reducing process interactions via the filesystem.
As such, creating an overly broad policy that includes
RW access to all of /run
and /var
would substantially diminish the security
benefits of Landlock.
Instead, allow a minimal set of paths that you need. For example, if you
need to access D-Bus, consider allowing /run/dbus
.
Inode-based operation
Policies are based on the state of the filesystem when a policy is applied, rather than a string comparison against path names. Internally, Landlock looks at the inodes that exist when the sandbox is entered, so if you need to create new files or directories you’ll want to specify a Landlock policy that includes RW access one directory level above.
Limitations of Landlock
Landlock cannot be used if the sandboxed process needs to modify its
filesystem topology, specifically via mount(2)
or pivot_root(2)
. For
additional background, see the official Landlock documentation.
Example Landlock config
Below is an example Landlock config file, for a process that needs access to
D-Bus and needs to write to a file in /var/lib/example_daemon
. If the config
file is named example_daemon.conf
, you can pass it to Minijail using
--config=example_daemon.conf
.
% minijail-config-file v0
# Filesystem access rules.
fs-default-paths
fs-path-rw = /run/dbus
fs-path-rw = /var/lib/example_daemon
# Other Minijail options....
Minijail wrappers (deprecated)
The Minijail wrappers are currently deprecated. They were designed to allow mocking of individual Minijail configuration settings but we concluded that this was the wrong level to mock Minijail. The mocks were fragile and wordy. A better way to mock Minijail is to just abstract away the entire sandboxed process execution. An example of this can be found in the SandboxedProcess class in debugd.
Enforcing Control Flow Integrity
Control Flow Integrity is a compiler feature for making certain memory corruption bugs much more difficult to exploit for code execution. Examples include type confusion, heap use-after-frees, heap buffer overflows, heap double frees, etc. Violations trigger crashes through SIGILL.
CFI is already enabled in debugd, and you can use that CL as an example for your service. Please note that CFI isn't free so there is a ~1% performance cost plus a potential binary size increase. However, enabling LTO is a prerequisite of CFI which typically leads to a ~3% performance gain and in some cases leads to a binary size decrease.
Control flow integrity support for ChromeOS is available through
cros-sanitizers.eclass which is already inherited for all ebuilds that
inherit platform.eclass. This means that for most first party ebuilds you
just need to set the appropriate USE flags and make sure everything still works.
For things that break with CFI enabled you can add them to a cfi-ignore.txt
file. This will be automatically applied if it is placed in the ebuild's files
directory.
Troubleshooting
Since CFI might break certain atypical usage of function pointers and type casting, it may be necessary to troubleshoot CFI related crashes. A good starting point is to enable the diagnostic and recovery modes of CFI and run the package's unit tests for example:
FEATURES=test USE='cfi cfi_diag cfi_recover thinlto' emerge-${BOARD} debugd
CFI violations will look something like:
ASAN error detected:
/build/amd64-generic/usr/include/gtest/gtest-matchers.h:234:12: runtime error: control flow integrity check for type 'bool (const testing::internal::MatcherBase<const std::string &> &, const std::string &, testing::MatchResultListener *)' failed during indirect function call
(/usr/lib64/libgtest.so.1.13.0+0x603f4): note: (unknown) defined here
/build/amd64-generic/usr/include/gtest/gtest-matchers.h:234:12: note: check failed in /var/cache/portage/chromeos-base/debugd/out/Default/debugd_testrunner, destination function located in /usr/lib64/libgtest.so.1.13.0
#0 0x55fd85b5d9e8 in testing::internal::MatcherBase<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>::MatchAndExplain(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, testing::MatchResultListener*) const /build/amd64-generic/usr/include/gtest/gtest-matchers.h:234:12
#1 0x55fd85b6106a in testing::internal::MatcherBase<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>::Matches(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) const /build/amd64-generic/usr/include/gtest/gtest-matchers.h:240:12
...
At this point it is up to you to determine if a source change is justified or if it is ok to add the particular function or source file to a cfi-ignore.txt. Generally, it is preferable not to exempt code from CFI unless there is a good reason such as when the code is only for unit test support.