On Normativity in Configuration Management

This document explores the design of host orchestration approaches. In particular, the following issues have central attention:

Methods for Operating System Deployment:
- PXE to conduct OS installs (stateful OS);
- PXE to run OS in RAM, with disks for data only (stateless OS);
- Use of IPMI and similar technologies.
Methods for Operating System Management:
- Push-based configuration management tools (ssh, Ansible, cdist)
- Pull-based configuration management tools (agent-based tools) (cfengine, Puppet, Chef, etc.)

It hopefully goes without saying, in this day and age, that anyone deploying operating systems in a non-automated fashion is “living in a state of sin”, so to speak. Automated, uniform operating system deployment aids the homogeneity of system configuration, which itself aids manageability. If operating system deployment is not automated, little can be assumed about the base state of any component system.

However, configuration requirements change. Automated operating system deployment solves configuration management and deployment problems, but only at T=0, the time of installation. As requirements change after initial deployment, means must be found to introduce these differential changes into hosts on a mass basis, while maximizing the reliability of the change process and the homogeneity and predictability of the resulting system state.

It is useful to consider the state of a distributed computing system in terms of normativity. The configuration of a distributed computing system can be defined in terms of some minimal data set, from which all other necessary data is derived. This is the normative data. A parallel can be drawn with the concept of Kolmogorov complexity, the idea that the complexity of data is most meaningfully expressed not in terms of its size, but the shortest possible program that produces it.

For example, if a file on host A (some sort of master) is to be copied to all other machines, then while all hosts end up with a copy of the file, only the copy on host A is normative.

Any non-normative data is by comparison disposable. It can be destroyed without serious consequence because all non-normative data is derived from the normative data and can thus be rederived. The costs of re-deriving non-normative data may vary, however.

Whether normative data always exists in a machine-readable form depends on how extremely the concept of normativity is interpreted. If a human specifies in English a policy to which a distributed computing system must be conformed, and the machine-readable data implementing this policy is derived from the policy in English, then the policy document comprises the true normative root of the applicable system state, but the process for deriving non-normative data involves unavoidable human labour. This is a cost of rederivation. But normative data shall usually be interpreted as the machine-readable data closest to the original expression of intent.

The concept of normativity is useful because it allows determinations to be made about what system state can be disposed of (and then presumably automatically reconstructed), and because it allows humans to make determinations as to what configuration data needs to be kept under version control.

If no normative data is kept on a machine, the OS state on that machine may be safely obliterated, because it can be reconstructed from the normative data.

If a change is made to non-normative data which is not first reflected in the normative data, the normative and non-normative data cease to be in correspondence. Generally, the conversion of changes in non-normative data to changes in normative data which generate exactly those changes is not an automatable task (or if it is, it is a research question). Therefore, system manageability requires that changes to non-normative data are eliminated as a factor. All state must flow from the normative data; changes to non-normative data other than by derivation from normative data must be wholly eliminated. (Or, at the very least, the routine obliteration of such changes must be acceptable; but it is preferable to eliminate such influences at source.)

This leads to the interesting concept of a distributed computer system defined by the normative data set existing on a single host. Since, as noted above, requirements change, so too will the normative data set; thus, there must be some means of effecting differential changes to the normative data set, by differential changes to the corresponding non-normative data derived from it.

One way to accomplish this is to concentrate all normative data on a single host which is the root of all normative data (the Root), making the system state on any other machine disposable. If PXE is used to boot machines to RAM, the process of initiating re-derivation becomes trivial; all non-Root hosts need simply be rebooted. Otherwise, automated OS installation methods can be re-triggered.

This does of course interrupt any service being provided by that host. For this reason, an architecture such as this works best when combined with a modern service design philosophy that emphasizes resilience and the disposability of individual hosts. But more generally these shall be known as the costs of rederivation. In extreme cases, total rederivation across an entire distributed computing system might have prohibitively or undesirably high computational cost.

If it is not desirable to have to effect all changes by rederivation, or the total elimination of normative data from non-Root hosts is not feasible or desired, then messier solutions are necessary. Perhaps there is no single Root, and so the system state of any given machine cannot be assumed to be disposable; or perhaps there is a general desire for a more differential means of deploying configuration, as opposed to the total non-differential rederivation of non-normative data (by rebooting or reinstalling), perhaps to minimize service disruption.

This then leads one into the territory of popular configuration management solutions. These solutions can be categorised by their approach to the dissemination of differential non-normative state:

Push-based solutions involve a process triggered on the Root (or some analogue to it), to which non-Root hosts respond. The Root selects the hosts it wishes to disseminate to. The trigger could be human, chronological or otherwise.¹
Pull-based solutions involve a daemon running on all non-Root hosts, which periodically or upon some means of notification, retrieves configuration information from the Root and processes it. Generally, this process does not involve the blind rederivation of non-normative data, but the application of a policy which introduces zero or more differential changes to local system state (non-normative data); though rederivation may be used in small, focused areas (such as configuration files produced from a template).

An argument against push-based solutions is that they are error-prone, since there is potentially an aspect of their operation (what to push to) which is left to human deliberation; therefore, the level of deployment of the changes to the normative data is left in question. Machines could be left out of a change, either by accident, or simply because they were not available when the change was pushed, for example because of a network outage or because they were turned off.

On the other hand, pull-based solutions require more rigorous means to identify the intended targets for a change, since otherwise all participating hosts will effect the change. But it is just as arguable that this would need to be documented anyway, and putting this documentation in a machine-readable form is desirable. (This is not to say in any way that this eliminates the need for separate documentation for humans.) Pull-based solutions may also require more hardware to serve hosts, as the load placed upon the Root cannot be controlled directly as it can when using push-based solutions.

A trivial pull-based solution can be constructed on a UNIX system by writing a shell script which retrieves a shell script from an HTTP (or FTP, TFTP, rsync, etc.) server and executes it, and then scheduling that shell script as a cron job. (It would be quite easy to add security to this using a private CA and curl, or by using PGP.) In fact, this solution is so simple it brings the virtues of highly complicated pull-based configuration management systems such as cfengine, Puppet and Chef into question.

An underlying common theme of such configuration management systems is their declarative nature. They have the user express objectives, and then realise those objectives automatically. The user does not usually express procedures directly.

But for the tasks which seem to be envisioned by cfengine and so on as common, the gain obtained by this indirection is minimal. Basic tutorial examples commonly given for such tools are for things like “make sure file /etc/foo is always chmodded to 644 and owned by root”.

These examples are somewhat disturbing; why was /etc/foo configured incorrectly in the first place? This rule does not seem likely to be for the purpose of implementing a change to normative policy (most configuration files in /etc will have only one correct mode and owner); rather, it suggests a lack of control over system state from the very beginning. But using such a T=n method to correct for what is ultimately a failure to correctly control T=0 state dissemination is nothing but a kludge, a highly effective one though it might be.

At any rate, such a change is trivially effected in a procedural manner via the following idempotent shell script:

#!/bin/sh
chmod 644 /etc/foo
chown root:root /etc/foo

The advantages of declarative configuration styles over periodically executing imperative scripts are likely to lie in two areas:

Declarative configuration's increased ability to achieve idempotence.

Since pull-based configuration systems constantly re-execute their policies, it is necessary that the execution of those policies be idempotent. The general idea is to have a gradual convergence to the point of full alignment with the policy expressed at the Root.

It is likely that declarative languages have an enhanced ability to express idempotent policies when compared with expression of policy in an imperative language. While the chmod example above is idempotent, it is not hard to imagine scenarios where imperative expressions of desired ends are difficult to express idempotently.

Even if idempotency is reasonably achievable in an imperative language, it is easy to forget to account for the need to it. Non-idempotent commands might be accidentally introduced into a policy. In this regard, imperative languages are likely to be more error prone. And in general, declarative languages may provide higher-level tools with which to express system state, yielding higher productivity than that obtainable with comparable shell scripts.

[Real-world examples of declarative language advantages would be helpful.]
Increased performance as a byproduct of its idempotent modelling.

Since in many cases idempotency may be modelled by avoiding carrying out a change which has already been carried out, a side effect of idempotency may be the enhanced performance with which convergence occurs. This in turn may allow more complex and resource-intensive policies to be expressed than would otherwise be possible.
Reporting capabilities.

Since a declarative language expresses a model of the desired state, it is possible for a system to form a rudimentary understanding of non-compliance with it. This allows for potentially more detailed and useful reporting to administrators on the degree of successful effectation of a differential policy.

It could be argued that failure of automated policy application is likely to indicate a prior failure to adequately control the flow of non-normative data from normative; or perhaps a failure by the policymaker to foresee the emergent consequences of any heterogeneity possessed by the distributed system and the resulting impact on the differential policies so expressed.

T=0 vs T=n

Because T=0 configuration solutions cannot accommodate T=n configuration needs, a T=n configuration management solution is obtained. But the T=0 deployment system and the T=n system are separate, which means either the T=0 system must have the normative configuration, or the T=n system, but not both. Alternatively, perhaps the configuration of both the T=0 deployment system and the T=n management system are non-normative data derived from some higher normative source. But this derivation process is not likely to be automated, and involves additional complexity.

An appealing approach here is to configure the T=0 system to the minimal degree possible to enable the T=n system to commence operation. Almost all normative data is fed to the T=n system. The T=0 system generates systems with only a vanilla configuration, which is then tailored to requirements by the T=n system.

This approach appears to be the most common one, being implied not only by the common usages of popular pull-based configuration management tools, but also being that implemented by Microsoft's Group Policy. No matter how singly a T=n system is relied upon, the need for a T=0 system cannot be avoided. The least labour intensive approach is likely to be to support a single vanilla profile for T=0 deployments and place all other configuration data in the T=n system.

This solution to the T=0 vs. T=n dichotomy feels rather crude, so one does wonder if there isn't a better way. On the other hand, the pragmatism of the solution does possess a certain elegance.

¹ Google apparently has automated systems which automatically deploy code committed to the HEAD of their version control repository so long as the unit tests pass. The system is sophisticated enough to roll back these changes automatically if they don't work out.