Or: Why can’t this kernel just do what I want, and stop containers from doing what we don’t want?

Here is my rambling adventure for the holy grail: Can we use seccomp to restrict some of what NET_ADMIN can do?

Initially this felt like a shoo-in. Seccomp allows fine-grained filtering of syscalls; and everyone uses syscalls to do the things otherwise allowed by NET_ADMIN. How hard can it be?

What syscalls do we need to limit?

There’s no simple 1:1 mapping between what NET_ADMIN allows and what syscalls are used to do the things. The problems range from the deceptively simple ioctl and setsockopt based interactions, to the horribly complex IFF_PROMISC which can be arrived at in at least 4 ways.

Promiscuous Mode

Promiscuous mode can be used, especially with macvlan and SR-IOV interfaces, to intercept DHCP traffic and potentially disrupt or redirect cluster traffic. It would be nice to limit this in pods that don’t need this facility but do need other parts of what is allowed by CAP_NET_ADMIN. So how do we block this?

This is complicated! There are at least 4 ways to set promiscuous mode on an interface:

  • An ioctl syscall with SIOCSIFFLAGS setting IFF_PROMISC
  • A setsockopt syscall with PACKET_ADD_MEMBERSHIP to add the PACKET_MR_PROMISC group (actually gated by CAP_NET_RAW, not CAP_NET_ADMIN)
  • RT_NETLINK messages adding the IFF_PROMISC interface flag
  • Side-effects of bridges and vlan devices

It quickly became clear that blocking all of these, especially the netlink socket messages, was going to be difficult if not impossible.

Setsockopt subset

So let’s try something easier! The following socket options are potentially dangerous for their ability to deny service to other network traffic:

  • SO_PRIORITY < 0 or SO_PRIORITY > 6 (0 <= priority <= 6 is already allowed to non-NET_ADMIN users)
  • SO_DEBUG (on/off)
  • SO_SNDBUFFORCE/SO_RCVBUFFORCE (all values)

Seccomp passes 6 arguments for each syscall to the eBPF machinery that does the work. Great! 6 arguments is more than most syscalls use anyway. So I should be able to just get whatever the user sent to the syscall and filter based on that!

Right? Nope.

We can’t dereference pointers

Some syscall args are easy simple ints, but most of the interesting data, such as how we distinguish between setting SO_PRIORITY to 2 versus setting it to 99 is hidden behind a user pointer. In more complex cases, this is a pointer to a struct with variable-length netlink socket messages. Even in the simple case, however, there’s no way today, using libseccomp, to dereference the pointer. So we’re left with looking at ints and pointer values, but not the data pointed at by the pointers.

eBPF itself purports to have primitives that allow this “get data from userspace”, but libseccomp isn’t using them.

We have no state

Even if we could get the actual flags being set on an interface, for example, we don’t have visibility of the current state of the interface flags. We can’t tell the difference between “Add IFF_PROMISC” and “There already is IFF_PROMISC, so keep that and add IFF_UP”

Libseccomp rule order limitations

For those few calls we can theoretically limit, such as “Prevent setsockopt SO_DEBUG to be set at all”, the format of the json file, (which gets translated in cri-o to libseccomp API calls, which is translated to an eBPF program which is sent to the kernel) is non-trivial to do multiple argument filtering and rule specification.

The default seccomp.json policy is:

  • By default return EPERM for any syscall
  • If you see a syscall matching “setsockopt” (or about 300 others), allow it through

It appears this allow-list has been carefully groomed to avoid “bad” syscalls and only allow “safe” ones. Great!

You can specify multiple rules, which are ORed together. This implies a first-match scheme, but this needs to be confirmed. Each rule can have 0 or more specific argument matchers, which are all ANDed together for that rule.

So, assuming a first-match implementation, I should just be able to prepend a rule that says “If you see setsockopt with arg[2] == 1, that should ERRNO”, and then any other call should fall-through to the next set of rules, which consist of the pre-existing allow-list.

Right? No!

Rules cannot match the default action

libseccomp complains and won’t proceed if a rule has the same action as the default action. Why? I don’t know. (Future investigation: Is this a limit of libseccomp or the kernel seccomp eBPF machinery?)

Okay. so then can I remove “setsockopt” specifically from the allow-list rule, and add one rule that says “Allow setsockopt if arg[2] != 1”? Yes! Good! I’ve filtered out setsockopt for SO_DEBUG, but not touched any other setsockopt invocations! A problem has been (partially) solved!

Now, how do I write a policy that filters out both SO_DEBUG and SO_PRIORITY? The “arg list” inside a single seccomp rule clause is logical-and with all the other arg matchers; so I should just be able to add a second arg matcher that says “AND if arg[2] != 12”.

Right? No!

seccomp.json parser converts ANDs to ORs sometimes

cri-o’s json-to-libseccomp engine sees 2 arg matchers for the same arg index, and does a weird “I assume you mean to OR these, not AND these, so I’ll do that for you”. Libseccomp itself doesn’t seem to like the idea of multiple matchers for the same index, either. Why? I don’t know. It would make sense if there are multiple matchers for the same arg that are all using ‘SCMP_CMP_EQ’, since AND-ing these together would be a programming error, and maybe we could assume the seccomp profile author got it wrong and meant OR instead. But OR of the same arg makes sense for all the other match operators.

In other words there are matching primitives to check equality, non-equality, gt, lt, ge, le, and mask-equals, but this prohibition on having the same argument index in the argument list more than once means there’s no way to combine these to represent “not equal to 13 and not equal to 42” or “greater than 0 and less than 6”, even though by the seccomp.json syntax it seems like it should be possible.

Inverting the default seccomp.json to allow-by-default

So that means the only way to actually deny multiple setsockopt calls is to invert the default seccomp.json to have a default “ALLOW” action, and then explicitly deny everything in the original seccomp file, and then we can tack on our own additional restrictions to libseccomp, each problematic operation value in its own rule with its own arg matcher.

So let’s invert that seccomp.json first...

Denied Syscalls

The following syscalls are denied according to the docker documentation from which the original seccomp.json originates: https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile

To support the true inverse of the original seccomp.json, though, there is more work to do; Some of these are conditionally allowed based on existence of specific CAP_* flags, some based on specific flags, and the list on the webpage is not exhaustive. Care would need to be taken to ensure that we’re truly denying the same set of syscalls as the default seccomp.json, and the best tactic would be to have auto-generation based on the actual list of syscalls defined in the appropriate linux kernel headers.

Clone flags

Clone is allowed as long as none of the bits in 2080505856 = 0x7C020000 are set. What are these?

#define CLONE_NEWNS     0x00020000   /* New mount namespace group */
                           131072
#define CLONE_NEWUTS    0x04000000   /* New utsname namespace */
                         67108864
#define CLONE_NEWIPC    0x08000000   /* New ipc namespace */
                        134217728
#define CLONE_NEWUSER   0x10000000   /* New user namespace */
                        268435456
#define CLONE_NEWPID    0x20000000   /* New pid namespace */
                        536870912
#define CLONE_NEWNET    0x40000000   /* New network namespace */
                       1073741824

Important! If you blanket-deny clone, instead of restricting only these troublesome bits, cri-o cannot start up the container!

Personality flags

The personality syscall is partially restricted to allow exactly the following flags for arg0:

a == 0 || a == 8 || a == 0x20000 || a == 0x20008 || a == 0xffffffff
  • 0x0 and 0x8 are Linux and Linux32. The intent here seems to be to exclude all non-Linux compatibility modes. Could mask 0xf7 to replicate.
  • 0x20000 is UNAME26 compatibility; easy to mask in addition to 0xf7
  • 0xffffffff is abused as “get current personality”

This cannot be fully emulated with the current libseccomp restrictions if the default profile is ALLOW, as there is no simple mask to capture the inverse flag set due to the need to allow 0xffffffff.

I haven’t addressed this yet, so the sample seccomp profile below won’t allow executables in containers working in Linux32 or Uname26 compatibility mode.

Redundant seccomp and capability restrictions

There’s more that the default seccomp.json does that I didn’t have time to fully flesh out. For example, rather than explicitly denying ‘reboot’ all the time, it is enabled if the CAP_SYS_REBOOT capability is set, and there are many other sets of syscalls that are also limited in this way. I’m not sure why this is; If the capabilities are restricted properly, these syscalls will also be restricted; so it seems oddly redundant to restrict them with seccomp at all. It should be safe, therefore, to remove all of these from the deny list in our inverted seccomp.

The result of the inversion

Inverted-seccomp.json

Testing inverted-seccomp.json

Using a 4.6.0-daily snapshot, let’s sub-in the custom inverted seccomp.json (that is, the part before the specific NET_ADMIN prohibitions) and see how they pass the ose cluster conformance tests.

Test Setup

Run using the latest ose-tests framework on a cluster with 3 virtual masters, 1 virtual worker, 1 bare-metal worker.

podman run -ti -v /root/ocp/auth:/etc/kubernetes:Z -e KUBECONFIG=/etc/kubernetes/kubeconfig registry.redhat.io/openshift4/ose-tests:latest openshift-tests run $SUITE_NAME

Test Results

 

Configuration

openshift/conformance/parallel

openshift/conformance/serial

Baseline

19 fail, 895 pass, 1439 skip

5 fail, 58 pass, 213 skip

Inverted

19 fail, 895 pass, 1439 skip

5 fail, 58 pass, 213 skip

Baseline: A cluster with the original seccomp.json

Inverted: Results with our inverted seccomp.json (minus the NET_ADMIN-specific prohibitions)

Failed tests

Parallel suite
[k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]
[sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS [Suite:openshift/conformance/parallel]
[sig-arch] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]
[sig-auth][Feature:OAuthServer] [Token Expiration] Using a OAuth client with a non-default token max age to generate tokens that do not expire works as expected when using a code authorization flow [Suite:openshift/conformance/parallel]
[sig-auth][Feature:OAuthServer] [Token Expiration] Using a OAuth client with a non-default token max age to generate tokens that do not expire works as expected when using a token authorization flow [Suite:openshift/conformance/parallel]
[sig-auth][Feature:OAuthServer] [Token Expiration] Using a OAuth client with a non-default token max age to generate tokens that expire shortly works as expected when using a code authorization flow [Suite:openshift/conformance/parallel]
[sig-auth][Feature:OAuthServer] [Token Expiration] Using a OAuth client with a non-default token max age to generate tokens that expire shortly works as expected when using a token authorization flow [Suite:openshift/conformance/parallel]
[sig-auth][Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune builds after a buildConfig change [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune canceled builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune errored builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune failed builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[sig-cli] oc adm must-gather runs successfully [Suite:openshift/conformance/parallel]
[sig-devex][Feature:ImageEcosystem][mongodb] openshift mongodb image creating from a template should instantiate the template [Suite:openshift/conformance/parallel]
[sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel]
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]
[sig-network] Internal connectivity for TCP and UDP on ports 9000-9999 is allowed [Suite:openshift/conformance/parallel]
[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-network] Services should create endpoints for unready pods [Suite:openshift/conformance/parallel] [Suite:k8s]
Serial suite
[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial]
[sig-auth][Feature:OpenShiftAuthorization][Serial] authorization TestAuthorizationResourceAccessReview should succeed [Skipped:ibmcloud] [Suite:openshift/conformance/serial]
[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
[sig-operator][Feature:Marketplace] Marketplace diff name test [ocp-25672] create the samename opsrc&csc [Serial] [Suite:openshift/conformance/serial]
[sig-operator][Feature:Marketplace] Marketplace resources with labels provider displayName [ocp-21728] create opsrc with labels [Serial] [Suite:openshift/conformance/serial]

Remaining Concerns

When creating the custom inverted seccomp profile, I did not do an exhaustive search of all possible syscalls, and instead based the “list of syscalls that should be blocked” based on the docker seccomp page which means there are some syscalls blocked by the original seccomp.json by default which are not blocked by the inverted seccomp profile. A deeper security review is required to ensure that these are not dangerous. If they are, it is easy to add any dangerous syscalls to the deny list in the inverted profile.

Adding our own custom rules

Finally, with our (mostly) inverted seccomp.json in hand, we can tack on a set of setsockopt rules to deny the specific calls we intend: SO_PRIORITY (all values, due to the inability to dereference the pointer where the actual value lives), SO_DEBUG, SO_RCVBUFFORCE, and SO_SNDBUFFORCE. Here’s the end result, tested via a custom golang snippet that tries to set these capabilities on a UDP socket:

example-seccomp.json

Testing our custom rules

Test Setup

I spun up a custom pod based on a recent fedora release, with CAP_NET_ADMIN granted. I then wrote a simple golang snippet that attempts to exercise the problematic syscalls.

https://github.com/lack/redhat-notes/tree/main/seccomp_netadmin/testing

Test Results

inverted-seccomp.json
# oc exec nettestpriv -- /bin/setsockopt
Starting the test
Trying to set priority to 2
Successfully set priority to 2
Trying to set priority to 7
Successfully set priority to 7
Trying to set priority to -1
Successfully set priority to -1
Trying to set debug mode
Successfully set debug mode
Trying to force RCV buffer
Successfully forced RCV buffer
Trying to force SND buffer
Successfully forced SND buffer

example-seccomp.profile

# oc exec nettestlimit -- /bin/setsockopt
Starting the test
Trying to set priority to 2
Setting priority to 2 returned error: operation not permitted
Trying to set priority to 7
Setting priority to 7 returned error: operation not permitted
Trying to set priority to -1
Setting priority to -1 returned error: operation not permitted
Trying to set debug mode
Setting debug mode returned error: permission denied
Trying to force RCV buffer
Forcing RCV buffer returned error: operation not permitted
Trying to force SND buffer
Forcing SND buffer returned error: operation not permitted

*whew*