How does an operating system decide which volume is the "system" volume (a.k.a. "system partition") when it is bootstrapped?
This is the Frequently Given Answer to such questions.
When an operating system is bootstrapped it must decide which volume is to be the system volume. There are two basic ways in which operating systems do this:
The operating system requires that the system administrator explicitly specify the location of the system volume using a configuration setting of some kind. This configuration setting is either a standard one provided by the machine firmware, or a non-standard one that is specific to one particular operating system.
The operating system defaults to using (or at least attempting to use) the boot volume as the system volume.
This is the approach to locating the system volume that is employed by modern operating system designs. The operating system boot loader, that is invoked by the machine firmware, consults configuration information provided by the machine firmware that specifies the locations of the operating system kernel program image file and of the system volume. The operation of locating the system partition is thus a relatively simple one.
What precise form this configuration information takes depends from the machine firmware.
The configuration information is created and written by the operating system installation utility when the operating system is installed. Modern machine firmwares provide facilities to allow system administrators to edit the configuration information without needing to actually bootstrap an operating system in order to run a configuration editing utility.
On machines with EFI firmwares, the location of the system volume is
determined by the value of a machine firmware variable that is stored in
non-volatile RAM. Each entry on the EFI Boot Manager menu is defined by
the value of a single NVRAM variable, named BootXXXX
where XXXX is a number. Each variable's value
comprises a rich binary data structure (the EFI_LOAD_OPTION
structure) that comprises the whole definition.
This binary data structure contains, amongst other things, an array of EFI Device Paths. (An EFI Device Path is EFI firmware's standard general-purpose mechanism for specifying hardware devices, disc volumes, and files.) The first device path specifies the location of the operating system boot loader program image file itself. This is the program image file that the EFI Boot Manager loads and invokes when the entry is selected from the menu. The second and subsequent paths are for use by the operating system boot loader. They specify the device path to the kernel program image file, the device path of the system volume, and so forth.
One thus specifies which volume is the system volume by editing the appropriate device path in the boot menu entry. The operating system boot loader takes the device path and passes it to the operating system kernel, translating it if necessary into whatever internal naming scheme the operating system kernel itself uses for specifying disc volumes. The operating system kernel in turn mounts the designated volume as the system volume.
ARC firmwares are very similar to EFI firmwares. Like EFI firmwares, they have a native general-purpose naming mechanism, ARC Paths, for specifying devices, disc volumes, and files. Like EFI firmwares, they have variables stored in non-volatile RAM that name the program image files of operating system boot loaders and that contain configuration information to pass to those boot loaders, such as the ARC Path of the system volume.
The only significant difference is in the implementation detail. Whilst
EFI firmwares store everything in the value of a single variable, ARC
firmwares store the ARC Paths of the operating system boot loader, the
kernel program image file that it loads, and the system volume, in the
values of three separate variables: OSLoader,
OSLoadFilename, and OSLoadPartition,
respectively. Thus the location of the system volume is the value of the
ARC firmware's OSLoadPartition variable, held in NVRAM.
(Confusingly, the ARC firmware itself makes use of a disc volume, the
ARC System Partition, and designates its location by the value
of the SystemPartition variable. This is
not the system volume.)
As with EFI firmwares, the operating system boot loader has to translate the ARC Device Path for the system volume into whatever naming scheme the operating system kernel itself uses for specifying disc volumes.
Windows NT needs no translation at all. The Windows NT kernel directly
understands ARC Paths, and Windows NT uses ARC Paths as its mechanism for
communicating between boot loader and kernel. NTLDR, the Windows NT
operating system boot loader, simply passes the ARC Path taken from
OSLoadPartition directly to the kernel, which in turn decodes
the ARC Path to determine the actual volume that is the system volume.
On machines with IBM PC compatible firmwares, there is no service available from the firmware for specifying the location of the system volume to the operating system boot loader. Operating systems thus simply pretend that they are using more capable firmwares, employing shim layers that sit between the operating system boot loader proper and the IBM PC compatible firmware.
This is the approach taken by Windows NT.
In Windows NT up to version 5.2, the boot loader and the kernel behave as
if they are running on ARC firmware, and a shim layer is added to NTLDR to
emulate the services provided by and the behaviour of such firmware. In
particular, NTLDR contains a shim that presents a menu of boot options to
the user (which the firmware does itself on real ARC systems) reading boot
configuration data from the /boot.ini file on the boot volume
(rather than from NVRAM as on real ARC systems). NTLDR also contains a
shim that switches the machine into protected mode before running the boot
loader proper, and that provides disc volume and console I/O services in
protected mode — which are provided by the firmwares themselves on
ARC firmware and EFI firmware systems.
The /boot.ini file on the boot volume provides just
persistent storage for the boot configuration data that are held in
non-volatile RAM on ARC firmwares, rather than a full general-purpose
NVRAM variable data storage service. Each line in /boot.ini
comprises the ARC Path of the system volume, the ARC Path of the kernel
image directory, and the kernel command line options — exactly what
would be stored in the OSLoadPartition,
OSLoadFilename, and OSLoadOptions variables on
an ARC firmware system.
The problems with this approach are akin to the problems with the roll-one's-own explicit configuration mechanism described next.
Because the firmware that the system is pretending to be there isn't actually there, but is merely a shim layer beneath the boot loader proper, some of the functionality of a system with the real firmware is lacking.
For example: The shim layer in NTLDR on IBM PC compatible firmware
systems provides no maintenance utility that allows allows entries on the
boot manager menu to be added, modified, and erased before bootstrapping
an operating system. On an IBM PC compatible firmware system one is as a
consequence faced with the chicken-and-egg problem of not being able to
edit /boot.ini without being able to boot the operating
system so that one can run an editing tool, and not being able to boot the
operating system without adjusting the ARC Paths in /boot.ini
to compensate for the changes in device names caused by moving discs
around.
In contrast, on a real EFI firmware or ARC firmware system it is possible to adjust the device paths in the boot manager menu entries to cope with moving disc units from one place to another, or with changing their IDs, because the firmware itself comprises a built-in Boot Manager Maintenance utility for doing so.
Because the data storage provided by the emulated firmware is being supplied by the shim using a private data storage area, only tools that know the private data storage area's location and format can alter the boot configuration data.
In contrast, on a real EFI firmware or ARC firmware system, the locations and formats of the boot configuration data are standardized, and can be manipulated both by the Boot Manager Maintenance utility and by any tools for EFI/ARC firmware configuration on any operating system.
Some operating systems make no attempt whatever to use the facilities that the machine firmware provides, but instead roll their own non-standard configuration mechanisms for explicitly specifying the location of the system volume, and use that mechanism on all systems.
Linux kernels, for example, make no effort to use the device path information
supplied by ARC firmwares and EFI firmwares. Instead, Linux kernels built
without the CONFIG_EDD option have the location of the system
volume hardwired into the in-memory image of the operating system kernel
itself, as a pair of major+minor device numbers.
This pair of values in the kernel image is set manually by the system administrator in one of two ways:
For kernel images that are written directly to a volume,
sector-for-sector, the system administrator is required to run the
rdev configuration utility after writing the kernel image to
the volume. This writes a given major+minor device number pair to the
correct place in the kernel image on the volume.
For kernel images that are written to a file on a volume, and loaded
indirectly via a boot loader such as LILO or GRUB, the boot loader modifies
the kernel image on the fly as it is loading it into memory, taking the
major+minor device number pair to use from the boot loader's own configuration
file (e.g. lilo.conf).
The problems of this approach are severalfold:
Changing disc IDs, moving disc units from one bus to another, or moving discs between machines, results in a chicken-and-egg situation: The operating system cannot be bootstrapped because the non-standard configuration information needs to be adjusted to reflect the altered disc unit number for the system volume, but the configuration information cannot be updated because in order to run the configuration update tool one must first bootstrap the operating system.
Several Linux boot loaders allow the major+minor device number for the system volume to be specified interactively at boot time. However, that does not wholly solve the chicken-and-egg problem. They require that the system administrator supply the major+minor device numbers that will be assigned to the system volume. But those device numbers are not necessarily known by the system administrator until after the operating system has booted.
Changing disc IDs and moving discs are not the only ways to cause this problem. Creating or deleting partitions on the disc unit containing the system volume creates the same chicken-and-egg situation: The operating system cannot be bootstrapped because the non-standard configuration information needs to be adjusted to reflect the altered partition number for the system volume, but the configuration information cannot be updated because in order to run the configuration update tool one must first bootstrap the operating system.
Creating or deleting a partition changes the assignments of minor device
numbers to disc partitions. So whenever a disc partition is created or
deleted (on the disc unit containing the system volume), the kernel
image's major+minor device number pair must be manually updated by the
system administrator (either directly with rdev or indirectly
by re-running LILO's or GRUB's configuration writing utility), otherwise
the kernel will use the wrong disc partition as the system volume.
In contrast, the boot configuration data on EFI firmware and ARC firmware systems have standardized locations and formats, and can be edited when such chicken-and-egg situations occur using either the maintenance utility that is built in to the firmware itself or any EFI/ARC firmware configuration tool from any operating system.
The volume naming scheme is non-standard and peculiar to each operating system. Where multiple such operating systems are involved, the multiplicity of naming schemes can easily become confusing. Further confusion can result when boot loaders introduce their own naming schemes, too.
For example: Whilst to the Linux kernel proper, the system partition may be known as "major device number 8, minor device number 1", Linux system administration utilities and Linux boot loaders all use "user-friendly" names instead. Whilst supposedly relieving the system administrator of the burden of remembering the actual device numbers by introducing an additional layer of indirection, they introduce problems by dint of the fact that there are at least two different schemes for such "user friendly" names.
Whilst the device file /dev/sda1 may represent that volume in the
filesystem once the operating system is running, and be the way to refer to it
with system administration commands such as dd, to the GRUB boot
loader it is named (hd0,0). To further add to the confusion,
other operating systems may use user-friendly names for the volume that are
similar but not quite the same, such as /dev/sd0s1.
In contrast, ARC Paths and EFI Device Paths are standardized parts of the machine firmware and are the same across all operating systems.
Some operating systems simply do not separate the concepts of "system volume" and "boot volume". They obtain the location of the boot volume and they use that as the location of the system volume.
This approach is rarely used on EFI firmware or ARC firmware systems, simply because those systems provide well-defined and simple mechanisms for specifying the location of the system volume, making it daft to do anything but use those mechanisms.
This approach is commonly used on IBM PC compatible firmware, however. IBM PC compatible firmwares provide no services to operating system boot loaders for locating the system volume, but they do provide services for locating the boot volume.
When the firmware (or the Master Boot Record) on a machine with IBM PC compatible firmware invokes the first stage of an operating system's boot loader, the Volume Boot Record of the operating system's boot volume, it makes three pieces of information available to the boot loader that the boot loader can use to locate the boot volume:
The firmware's disc unit number for the disc unit containing the boot
volume is passed to the VBR in the DL register.
The offset of the start sector of the volume from the beginning of that disc is recorded in a field in the BIOS Parameter Block contained within the VBR.
The length of the volume in sectors is recorded in a field in the BIOS Parameter Block contained within the VBR.
The boot loader passes these data to the operating system kernel, which in turn uses them to locate the system volume.
Operating systems such as PC-/MS-/DR-DOS use the IBM PC firmware for their I/O to and from the actual disc units. It is therefore a simple matter for them to determine which volume is the system volume given the aforementioned data.
The major problem with this mechanism for determining the system volume is that it requires that the operating system's idea of disc unit numbers exactly matches the machine firmware's idea of disc unit numbers. Thus either the operating system must use the machine firmware for all disc I/O, or the operating system's own disc device drivers must exactly mirror the machine firmware's disc device drivers. In particular:
The operating system must not have device drivers for disc units that the machine firmware does not understand (unless they are assigned disc unit numbers that are unused by the machine firmware).
For example: If the machine firmware only understands how to access ATA discs, but the operating system has disc device drivers for both ATA and SCSI discs, the device driver initialization order must be such that the operating system assigns the same disc unit numbers to the ATA discs that the machine firmware does. If the machine firmware assigns the first disc unit number to an ATA disc, but the operating system disc driver initialization order causes the first disc unit number to be assigned to a SCSI disc, the operating system will attempt to read the system volume from entirely the wrong disc unit.
Even if the operating system and the machine firmware understand the same kinds of disc, the operating system's device drivers must initialize in the same order that the machine firmware's device drivers assign disc unit numbers.
For example: Many ROM firmware extensions for SCSI Host Adapter cards have the configurable option of assigning disc unit numbers to SCSI hard discs either ahead of or following the unit numbers assigned to ATA hard discs. Therefore whenever this option is changed, the operating system's SCSI and ATA device drivers must be re-ordered to initialize in the same order that the machine firmware has been configured.
This scheme does not work at all for operating systems that simply do not employ the same disc unit numbering system as the machine's firmware does, but have a very different native disc device naming scheme. This is the case for Linux, for example. There is no straightforward mapping between IBM PC firmware disc unit numbers and the major+minor device number pairs that Linux uses to identify disc volumes.
In order to determine which volume known to the operating system kernel corresponds to the disc unit number supplied by the machine firmware, such operating systems employ a bodge: The operating system boot loader reads data from the boot volume, via the machine firmware, and then the operating system kernel reads data from each volume in turn, via its own device drivers, until it finds a volume with matching data. This volume is then designated the system volume.
This is the mechanism that is employed by Linux kernels built
with the CONFIG_EDD option.
It is a problematic mechanism for several reasons:
The most obvious implementation of this mechanism is for the boot loader to read the Partition GUID from the boot volume's partition table entry, and then for the operating system kernel to look for a volume with that Partition GUID. Partition GUIDs are intended to uniquely identify a partition.
However, this requires that the disc containing the boot/system volume be one that is partitioned using the GUID Partition Table scheme . The MBR Partition Table scheme does not have unique identifiers for partitions. Thus boot/system volumes cannot reside on discs partitioned with the MBR Partition Table scheme, or on non-partitioned discs such as floppy discs.
Furthermore, this implementation requires that the kernel itself understand both the GUID and the MBR partitioning schemes, since it has to read every partition table entry on every disc unit in the system until it finds one with a matching signature. Whilst many operating system kernels implement partition table handling in kernel space, not all do; and for those that do not a mechanism that requires that the kernel understand partition table layouts is not satisfactory.
The second choice for implementing this mechanism, which is the one used
by Linux kernels built with the CONFIG_EDD option, is to
identify the disc unit containing the boot/system volume using
this approach, but for the kernel to identify the actual
partition on the disc for the volume using the start and length
information supplied by the machine firmware (and passed along by the boot
loader).
However, only discs partitioned with the GUID Partition Table scheme have a Disc GUID that uniquely identifies the disc. Discs partitioned with the MBR Partition Table scheme may have a 32-bit disc signature, but only a few disc management tools (such as Microsoft's Disk Administrator) actually attempt to assign signatures to discs at all. Most such tools do not. Furthermore, non-partitioned discs have no signature fields at all. Thus boot/system volumes can only reside on discs partitioned with the MBR Partition Table scheme if those discs have been given signatures by Microsoft's tools, and cannot reside on non-partitioned discs such as floppy discs at all.
The third choice for implementing this mechanism is to read disc sectors 0 and 1 and compute a signature from their contents using an algorithm such as MD5.
However, this mechanism breaks because the signature of the disc unit changes if either sector changes in any way. Writing new MBR code will change the disc unit's signature, for example. So too will creating and deleting primary partitions.
Furthermore, choosing a more limited set of data from which to calculate the signature, in order to prevent signatures from changing too readily, in its turn risks duplicate signatures. If just the contents of sector 0 are used to calculate the signature, for example, then all GUID Partition Table discs of the same size will have identical disc signatures, because they all contain nothing but a dummy MBR Partition Table and dummy boot loader code in sector 0.
Whatever disc signature is used, this mechanism requires that the kernel at the very least read from every single disc unit in the system at system bootstrap.