June 09, 2016

Managing configuration drift with Salt and Snapper

Introduction

Many configuration management tools originate in the DevOps space and become immensely popular and while they do manage configuration, they are tailored towards deployment of new servers using this configuration and not towards auditing of existing servers.

For example, lets imagine a server with the following state:

/etc/motd:
  file.managed:
    - source: salt://common/motd

If we apply this state (in test mode) on a non-compliant server:

$ salt minion1 state.apply test=True
minion1:
----------
          ID: /etc/motd
    Function: file.managed
      Result: None
     Comment: The file /etc/motd is set to be changed
     Started: 10:06:05.021643
    Duration: 30.339 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -1 +1 @@
                  -Have a lot of fun...
                  +This is my managed motd

Summary for minion1
------------
Succeeded: 1 (unchanged=1, changed=1)
Failed:    0
------------
Total states run:     1

Salt is able to tell us that there is a file that deviates from the configuration. And we can easily fix it by just removing test=True.

Now, lets say an intruder adds a malicious entry to /etc/hosts:

192.168.1.34    www.google.com

If we re run our state in test mode:

$ salt minion1 state.apply test=True
minion1:
----------
          ID: /etc/motd
    Function: file.managed
      Result: None
     Comment: The file /etc/motd is set to be changed
     Started: 10:12:11.518105
    Duration: 29.479 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -1 +1 @@
                  -Have a lot of fun...
                  +This is my managed motd

Summary for minion1
------------
Succeeded: 1 (unchanged=1, changed=1)
Failed:    0
------------
Total states run:     1

As expected, it did not find anything, because this rule is not in the configuration.

Creating new systems vs auditing existing systems

This model works fine in the DevOps world where the culture is to take a random Linux image from the internet and use it as a base to deploy systems from scratch. As long as all tests pass, replacing the underlying image is not a problem. Only what is explicitly defined is evaluated against the configuration and defined as a drift.

When meeting enterprise customers who are starting to use configuration management to improve the control on their infrastructure, it turns out their expectations where different. “If I use Salt, will it tell me when somebody makes a change to the system?”. “Ugh.. no… well depends…”.

Baselines

That was the point that I started to think about - how could we use the state system to do more generic auditing and how do you do it without ruining the experience of working with states? Then it all clicked – implicit state can be done explicitly by using another state. When the customer said “any change”, they were in reality saying “any change against my defined configuration” plus “any change since my last working configuration”.

So, we needed a way to manage “last working configuration” and turns out SUSE is where Snapper originated and Snapper is nowadays available with most Linux distributions.

Snapper is a set of tools over snapshots (mostly btrfs, but also works on others like ext4 if you have the required kernel/tool patches). Think of it of what docker did to containers, snapper does to snapshots. It adds the required workflows, terminology and tools to make them usable.

It also turns out that my system already has some snapshots, because just like I can manually take one, tools like YaST and zypper take snapshots before and after doing operations. You can even select previous snapshots from the bootloader and boot into the previous working system.

What if I could describe a state in Salt that said: “Nothing deviates from this snapshots, except….”.

Let’s do it

So during this year Department workshop I paired with Pablo and our project had the following steps:

  • Complete the Salt execution module to expose the basic snapper operations you can do from the command line. Example:
salt minion1 snapper.create_snapshot
  • Create a generic way for sysadmins to do Salt operations which can be reverted. We implemented this as a meta-call (a call taking another call as a parameter) snapper.run. So you can do something like:
$ salt minion2 snapper.run function=file.append args='["/etc/motd", "some text"]'
minion2:
    Wrote 1 lines to "/etc/motd"

This will generate a snapshot before running the command, run the command and then take a snapshot afterwards, also adding metadata about the Salt job that did the change:

...
pre    | 21 |       | Thu Jun  9 10:34:36 2016 | root | number  | salt job 20160609103437556668 | salt_jid=20160609103437556668
post   | 22 | 21    | Thu Jun  9 10:34:37 2016 | root | number  | salt job 20160609103437556668 | salt_jid=20160609103437556668

Because in Salt, state is implemented as a method state.apply or state.highstate, calling snapper.run function=state.apply means you can rollback a failed state.apply.

And of course we not only exposed snapper.diff which takes the snapshot number but also a snapper.diff_jid which tells you what a Salt job changed:

$ salt minion2 snapper.diff_jid 20160609103437556668
minion2:
    ----------
    /etc/motd:
        --- /.snapshots/21/snapshot/etc/motd
        +++ /.snapshots/22/snapshot/etc/motd
        @@ -1 +1,2 @@
         Have a lot of fun...
        +some text

Additionally, you get snapper.undo_jid which you can guess what it does: it undoes the changes done by a specific salt job (which of course could be a state.apply run).

  • And finally, allowing a system administrator to use snapshots as a baseline to apply state. Lets take the original example with the malicious user modifying `/etc/hosts’, we will add a snapper state rule:
my_baseline:
  snapper.baseline_snapshot:
    - number: 20
    - ignore:
      - /var/log
      - /var/cache

/etc/motd:
  file.managed:
      - source: salt://common/motd

Now we apply the state in test mode again:

$ salt minion1 state.apply test=True
minion1:
----------
          ID: my_baseline
    Function: snapper.baseline_snapshot
      Result: None
     Comment: 1 files changes are set to be undone
     Started: 12:20:24.899848
    Duration: 1051.996 ms
     Changes:
              ----------
              files:
                  ----------
                  /etc/hosts:
                      ----------
                      actions:
                          - modified
                      comment:
                          text file
                      diff:
                          --- /etc/hosts
                          +++ /.snapshots/21/snapshot/etc/hosts
                          @@ -22,5 +22,3 @@
                           ff02::3         ipv6-allhosts


                          -192.168.1.34    www.google.com
                          -
----------
          ID: /etc/motd
    Function: file.managed
      Result: None
     Comment: The file /etc/motd is set to be changed
     Started: 12:20:25.953348
    Duration: 20.425 ms
     Changes:
              ----------
              diff:
                  ---
                  +++
                  @@ -1 +1 @@
                  -Have a lot of fun...
                  +This is my managed motd

Summary for minion1
------------
Succeeded: 2 (unchanged=2, changed=2)
Failed:    0
------------
Total states run:     2

Exactly what we expect!.

Conclusions

So with this you can use your configuration management to manage your state against a defined state and on top of that we give you the tooling to inspect and rollback configuration changes.

We will continue adding the missing pieces to give the administrators full overview and control over their running systems.

You can find our current work in this github repository. We plan of course to send it upstream once the design and implementation settles down.