Useless Operating System

Useless Operating System

Code

In my final project for the advanced operating system class taught by John Regehr, I implemented parts of an operating system inspired by exo/micro kernel design.

The overall goal is to expose hardware as much as possible while providing protection.

I decided to stick with amd64 architecture as porting to aarch64 seems unfeasible given the time frame and I primarily use amd64 CPUs so I want to know more about them. UOS does not do identity mapping of physical memory outside boot thus kernel will be using virtual memory in almost same way as user mode programs.

After implementing small OS I recommend it to all low-level or high performance oriented programmers. More importantly, OS development is joy, especially the design of interfaces and low-level tinkering. Unfortunately, developing a usable operating system is no small task (all those drivers) and user mode applications severely constrain the kernel design. I therefore moved to compilers and programming languages, which are tractable for a single person (in the days of LLVM), and can significantly improve ones life by making programming fun againk

Also, the development made me finally appreciate debugger, where simpler printf can't be used as registers need to be inspected before and after instruction execution, e.g., before and after sysret.

Important

Motivation

In the class we used xv6, a clean Unix clone with approachable code. However, adding VESA BIOS Extensions framebuffer required new system calls and kernel changes which was clunky. I especially disliked that any experiment required tinkering with the kernel, and provided abstractions were quite high level.

Moreover, the xv6 operating system runs on 32-bit architecture, but most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is much simpler as it removes tons of old cruft such as segmentation and has simple system calls.

The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a bit L4 (microkernel). For example, almost all exceptions are handled by user and not kernel, including page fault. The timer interrupt is forwarded to user, but in case of buggy/malicious container, kernel will wait few slots and then kill the misbehaving container. Since the user knows more about its use of registers, it can save only the necessary ones, e.g., may not save floating point registers. On context switch, kernel stores only instruction and stack pointers.

I decided to go with capability-based permission system, where each resource is associated with a "lock" and the key to the lock consists of hashed permissions and the lock itself. Kernel/user can easily verify the key valid by hashing the lock and permissions of the key and checking if the key's hash matches. This approaches makes kernel simpler, as memory that can't be exposed to user (e.g., < 1 MB) is simply associated with a lock that only kernel knows. Therefore, no special range checks are required in the kernel when user maps memory. Shared memory is also simple to implement, as a container can easily create keys for the memory it owns and give it to a newly created container. Also, kernel uses same interface as user, i.e., it presents key if it needs to map memory. TODO: how to safely store key if Meltdown is present?

UOS in constrast to Linux does not have a text mode. All assembly is in separate files as I am convinced inline assembly is not good solution. Moreover, kernel consists of only single .c file, which simplifies the build system to a single compile command. It targets desktops and uses 2 MiB pages.

Amd64 Architecture

Most desktops or laptops contain amd64 processor either from Intel or Amd. Since UOS targets personal computers it supports only Amd64 architecture. This choice simplifies the kernel as 32bit architectures have much more legacy cruft. For example, Amd64 has no segmentation, good syscall interface, simpler instruction set and only one necessary task segment. The arguments for C ABI are passed in registers. UOS does not allow kernel to read any user memory without explicit mapping. Moreover, since there is more registers, all syscall arguments must use them.

Any virtual address must be in canonical form, i.e., if bit 47 (counting from 0) is set, then all most significant bits must be also set. This quirk divides the address space into 2 pieces with huge hole in between them. Pretty annoying as UOS is not higher half kernel.

The System V ABI requires first 6 integer or pointer arguments to be passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The remaining arguments (7 and more) are passed on stack.

Overview of Startup

At boot, Basic Input/Output System (BIOS) loads specific sector of 512 bytes into memory at fixed address and starts execution. Usually, this program is a bootloader that loads the kernel, or in our case second stage of the bootloader as 64 bit assembly takes more space and I could not fit the ELF loader into a single sector. The second stage parses ELF file (kernel) on disk and loads it into memory. The kernel bootstraps itself by setting up physical and virtual memory managers and core struct for the boot processor. It then switches to new page table and stack and continues by starting other processors and loading user init program into a container. We will now discuss individual stages in more detail.

Bootloader

We focus on using BIOS instead of UEFI due to simplicity of the setup as there is no need for additional tools to build BIOS bootable binary. The disadvantage of BIOS is all the legacy cruft, such as starting in Real Mode and having only 512 bytes for bootloader.

The purpose of bootloader is to do initial setup (such as video mode, virtual memory) and load kernel from a disk into a memory. Since we could not squeeze bootloader with ELF loader into 512 bytes like xv6, we split the bootloader into two stages, where the first stage's purpose is to only load the second stage. For now, we use 4 KiB of space for both stages and kernel starts at sector 8.

After the video mode and long mode setup, the bootloader reads kernel from ELF binary into predefined location and jumps into the kernel entry point. IDE driver is used to perform reading from a disk. UOS does not support multiboot standard as xv6, so it will work only with our bootloader.

The first stage must be 512 bytes and the assembly needs to be padded to fill 512 bytes (and adding MBR bytes at 510 offset).

A20 Line

Used to ignore 21st bit of address to emulate wraparound from older architectures. We have to enable this 21st bit so we can address all memory.

xv6 bootloader uses keyboard controller to enable this bit. We opt for BIOS and its functions.

movw $0x2401, %ax
int $0x15 

http://www.win.tue.nl/~aeb/linux/kbd/A20.html

Global Descriptor Table

entry point of bootblock.o must be at the address 0x7C00 which can be double checked by doing objdump -d bootblock.o

The simplest "debugging" can be done by writing to screen; segmentation must be used as CGA address of 0xB8000 is outside 16-bit range. This addressing can be done by setting %ds to 0xB800 (segment is multiplied by 16 on access).

We directly transition from real mode to long mode without going through protected mode. Since the boot sector is small (512 bytes) we have to manually set up page tables instead of loading them from an array.

We can't use .data segment in bootloader C code (for example to store static string) as only .text is copied with objcopy (-j flag; TODO).

Bootstrap Page Table

We map same physical address [0, 1 GiB) to the virtual addresses [0x0, 1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as we linked kernel to reside at address 0x8000100000 (as we want it to be mapped into each process space) and when we jump to it it must be mapped properly. The first 1 GiB in virtual address space are mapped because it is the space in which bootloader lives and from where we will perform the jump.

The reason for choosing address where all 39 bits are zero is so we can have only top level page directory (PML4) which shares mapping with the [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry to the same entry in PDP (next level page table) as for the first entry.

direct switch to long mode

Real Hardware

UOS does not boot on my desktop as it would need SATA driver. If booting from USB, tt turns out if the disk lacks partition table the BIOS will set DL register to 0x0, i.e., floppy, and floppy does not support linear base addressing (LBA) which we use in IDE driver.

ELF

Fixed-length ELF header and variable-length program header listing

For some stupid reason .rodata is together with .text in same segment, making .rodata executable. (http://thiƩbaud.fr/rodata.html)

Symbol Table

Symbol table associates names with linear addresses and thus can be used by the loader to find addresses of exception and interrupt handlers (those must have some agreed names).

Drivers

UART

Serial port for debugging. Uses port-mapped I/O which ignores virtual memory (thus even if we do not map first 2 MiB it works).

IDE driver

Reference

We need to read kernel from the primary ATA disk in bootloader to initialize the core OS.

We wrote the function for reading sectors in assembly to avoid inline assembly. Also, we could wrap individual instructions in assembly functions but it was much simpler this way.

Kernel Bootstrap

Mostly disabling legacy stuff and enabling modern features as the processor assumes it woke up in last century.

Memory Management

There are roughly three levels of memory management:

We use recursive page table mapping to avoid the need to remap page tables each time we need to access them or to use contiguous memory storage for the VA to PA and PA to VA conversions which are necessary for updating the page table. The recursive mapping is the simplest as we need these conversions only for page table manipulation. In other cases such as memory mapped IO we would use a simple mapping.

Hmm, recursive mapping allows us to access entries of page table only for currently loaded page table which is really annoying.

There are several approaches:

Recursive Page Table

By mapping pml4 entry to pml4 we can access all paging structures for current process. The main advantage is the simplicity and constant virtual address space used in kernel (since only pml4 is mapped in kernel space). If we need to modify another process's page table we can point it to a pml4 slot and use the recursive trick on pdp (which is pml4 of the other process).

Unfortunately, there are many disadvantages:

Duplicated Page Table

For kernel we have a separate page table data structure that mimics hardware page table.

TODO: use completely different data layout so we can easily traverse the pointers from pml4 -> pdp -> pd.

For user space we may use recursive mapping to allow easy view of the current virtual address space. (TODO: what about TLB pollution? We can't easily map page tables as they live in kernel heap)

Backing pages cannot be moved without knowledge of the heap to avoid having a separate structure for page tables.

Initialization

Since at boot we do not have valid kernel page table we have to build one. This build process is quite different from modifying current page table (one in CR3) as we can't use recursive mapping to access the pml4, pdp, and pd tables. Furthermore, since the page table is not used we do not need to invalidate TLB entries.

We assume the heap is physically contiguous at the boot time and use a simple offset to setup a initial kernel page table. Alternative solution would be to walk the boot page directory and find linear address for a given physical address (the other way is handled by recursive mapping).

Manager

Still go back and forth between the options but leaning towards thin layer in kernel which will provide frame management and ability to map virtual to physical addresses. The kernel would enforce permissions.

The other options are:

Security

If I go with 2 MiB pages than we need at least 4 pages for a program (8 MiB):

Heap

I started with a simple heap that allocates multiple of 4096 bytes, but after watching Switch hacking talk (TODO reference) I plan to have a separate allocator (slab) for each struct type, thus avoiding the potential vulnerability of overwriting different struct that can happen on heap (as two different types of struct cannot possibly overlap, and also there are no pointer casting issues).

Security

The UOS is a exo/micro kernel design and thus it runs all drivers in user mode (ring 3).

Kernel is prohibited to write to any user mapped memory directly and must map the memory to separate address if it wants to write there (for example for the info structure). Kernel never reads any user memory and all syscall arguments must be passed in registers.

The recent side-channel attacks are concerning and I will probably go with the separate page table for kernel and user (with the addition of PCID). The Spectre attack seem much harder to mitigate. The attack allows user to read all memory, thus it could read kernel key and corresponding capabilities created from the key. One solution would be to store the keys and capabilities in a special area not mapped when user space is in flight (probably hard to not make a mistake); could we create kernel where everything is non-sensitive?.

he UOS is a exo/micro kernel design and thus it runs all drivers in user mode (ring 3).

Kernel is prohibited to write to any user mapped memory directly and must map the memory to separate address if it wants to write there (for example for the info structure). Kernel never reads any user memory and all syscall arguments must be passed in registers.

The recent side-channel attacks are concerning and I will probably go with the separate page table for kernel and user (with the addition of PCID). The Spectre attack seem much harder to mitigate. The attack allows user to read all memory, thus it could read kernel key and corresponding capabilities created from the key. One solution would be to store the keys and capabilities in a special area not mapped when user space is in flight (probably hard to not make a mistake); could we create kernel where everything is non-sensitive?.

Capability-based Mechanisms

Instead of user/group based system have fine grained badges that give owner permissions. Access control lists (ACL) store list of things that have access to given resource. Capabilities store list of resources a thing can access.

Strive for simplicity

c-list (capability list), trapdoor functions?

random numbers (rdrand, rdseed)

How to do these?

Anton Burtsev, http://www.ics.uci.edu/~aburtsev/

References:

Design

Initial approach was to have each frame associated with capability, thus when use allocated a frame he would get a key to it. This method was clunky as it would be tricky to allocate multiple frames and it placed large burden onto user. Furthermore, we were associating the object number with the key to identify it, which caused same problems.

Instead, we use user provided key (can be random number) when a frame is acquired, thus the same key can be used with any number of frames. User can create capability by hashing the key and permissions. Then when any access such as mapping the frame is performed, OS performs same hashing and check if the hash in capability matches the computed one.

TODO: test if we can use AES instead of sha256

Multicore

There are at least two ways to detect Application Processors (AP) after Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor Specification tables or ACPI (specifically MADT).

xv6 uses MPS tables which are obsolete, so we focus on the ACPI approach. The ACPI table pointer structure lives at some memory range and can be checksumed to verify if the struct is really an ACPI table. Since Qemu does not support x2APIC we have to memory map the LAPIC registers to query the APIC id (other alternative is to use CPUID). The ACPI table tells us this address and moreover it has a flag if core is working (the table is filled during boot by each core).

After parsing the tables we loop through all core entries and if the LAPIC id is different from the id of BSP we attempt to start the core with interprocess interrupt. Then we wait until the core updates the shared core array with running flag. We could do exponential startup for large core counts, but currently it is not an issue. Note, we use x2APIC which uses MSR as an interface instead of memory mapping.

Each core has its own kernel stack and pml4 with shared entries for the kernel data structures (including heap) and for the entry which holds all core kernel stacks (TODO: should cores see other cores stacks?).

User Mode

Graphics

The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with 256 colors with linear addressing) with BIOS interrupt 0x10 in protected mode. This mode switch is done in bootloader before protected mode is entered as only real mode can sanely interact with BIOS to manipulate VGA graphics.

# two moves can be combined into movw $0x0013, %ax
movb    $0x0, %ah  # call function 0x0 to set video mode
movb    $0x13, %al # use video mode 0x13
# or movw $0x13, %ax
int     $0x10

After changing video mode, the pixel can be accessed at physical address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000.

Unfortunately, the small resolution (and only 256 colors) provided by VGA is restrictive and unsuitable for any graphics based system.

The simplest solution is to use VESA BIOS Extensions (VBE) which can provide more than 1kx1k resolutions with up to 16M colors. Even though modes defined by VESA are considered deprecated, they are likely to work and allow to avoid finding suitable mode. This predefined modes come handy as real mode needs to be used and xv6's boot code can't be more than 512 bytes. Other solution would be to drop down to real mode before switching to long mode (TODO).

For example, setting 1280x1024 with 16M colors and loading its information.

movw    $0x4F02, %ax    # set VBE mode function
movw    $0x411B, %bx    # 1280x1024 24bit mode (with linear framebuffer)
int     $0x10

movw    $0x4F01, %ax    # get mode information function
movw    $0x411B, %bx    # mode
movw    $0x0, %di       # store the mode information struct (256 bytes) at physical address 0x0
int     $0x10

Than we can qeury mode parameters by loading the ModeInfoBlock structure. For simplicity, we load it at fixed physical address 0x0 as kernel code starts on address 0x100000. Since we care only about the PhysBasePtr field (tells us location of linear framebuffer), we could pass it as function parameter to the kernel bootstrap function.

Now, the tricky bit; we need to create virtual memory mapping to the physical address in PhysBasePtr. On my machine in Qemu, the address is 0xFD000000 and we need to map 4 MiB (depends on resolution and bit depth) up to the address 0xFE000000 (defined in memlayout.h as DEVSPACE, i.e., memory mapped I/O).

I used mappages (vm.c) in function main (main.c) to map some fixed virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr before starting other processors.

struct vbemodeinfo const *const vbe = (void *)KERNBASE;
int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr, PTE_W);

Now, pixels can be written at the address FRAMEBUFFER anywhere in kernel. I was surprised I could not find any exported function to perform the memory mapping explicitly, another reason for exokernel.

Floating Points

By default disabled on amd64

We can test in bootloader first if AVX2 is supported, and if not hang the OS. Some OS will trap if user used floating point unit and enable it. This mechanism allows OS to avoid storing/restoring floating point registers during context switch if process does not use those registers. Since UOS kernel does not save context at all (except few registers on timer interrupt), it enables AVX2 at the kernel startup and let's user to do register store/restore if necessary. In fact, this approach is simpler as user knows what registers are being used and can do some tricks. Moreover, since kernel does not use floating point registers, user can pass values in registers to the next scheduled container.

Kernel

We map kernel at high virtual address as user space programs assume loading at address 0.

Global Descriptor Table

The GDT from bootloader has only kernel code segment and no user code segment, nor Task State Segment (TSS). Initially, we had one GDT shared by all cores, but each core needs its own TSS, thus we could either create an entry for each core, or better, duplicate whole GDT to obtain full flexibility at mere cost of extra 32 bytes per core. Keeping the GDT separate is the right step into the direction of sharing as little as possible between cores to scale well on many core architectures (> 32 cores).

Memory Layout

UOS's kernel does not remap physical memory contiguously (higher half) as xv6 does. Instead, the kernel uses the same interfaces as user programs, allocating frame and mapping it. In linear address space, the kernel lives at entry 511 (last) in PML4 page table and thus takes 512 GB. We chose to take whole entry for simplicity as it can be easily shared by all user programs. Moreover, we mark the entry as global to avoid flushing TLB on context switch.

Each core's stack lives in entry 510 and is again marked global. Stacks are one page large (2 MiB), and have guard pages interspersed between the stacks. We do not have kernel stack per container as there are no slow system calls in the kernel (such as IO) that would need to be interrupted. This push of expensive operations to user space makes kernel simpler and easier to reason about.

Linking

Containers

Instead of notion of a traditional UNIX process, UOS uses containers as its execution unit abstraction. A container is akin to a process, but it also has access to exceptions, low-level memory management, and scheduling. For example, user can easily implement preemtive threads inside a container.

In short, container has following entries:

In contrast to xv6, we do not have any scheduler thread, so the kernel thread becomes a container after bootstrap and if container is killed, next container is scheduled (or in lack of container kernel panics).

User Space Bootstrap

The initial user mode binary is embedded into the kernel as there is no disk driver in kernel. All binaries use custom link script so there are only .text, .rodata, and .data, and they have proper permissions so they can be loaded into separate pages.

The ELF file is parsed for timer interrupt handler which needs to be passed to function allocating quantum (so it will be scheduled to run).

System Calls

xv6 uses interrupt 0x10 for system calls. We use the native syscall/sysret instructions present on amd64 architectures specifically designed for system calls.

Since the syscall instruction uses rcx register to store the returning instruction pointer we copy the value in the register to caller saved register and restore it back in the syscall to preserve C calling convention.

We may need to move away from C calling convention due the need of having more than 6 function arguments for the syscall (7-th is on the stack and uOS is prohibited from touching any user memory).

UOS relies on a C calling convention for system calls, but it must clear caller saved registers before returning to the user as it could leak information from the kernel (e.g., private key).

Entering User Mode

Since amd64 has no "direct" instruction to switch to user mode we have to pretend that the container is returning either from interrupt (iretq) or system call (sysretq). We used the system call intruction as it does not touch user's stack, thus we can avoid writing to user's memory (which we avoid for security reasons).

Traps and Interrupts

256 slots total.

Must be acknowledged

Priorities, higher is handled first

TODO: do software excetions need to be acknowledged?

Needs one Task State Segment (TSS)

User space exception list:

User space rare exceptions (which we let double fault):

Kernel space exception list:

Legacy exceptions:

Story

Thanks to #osdev for help.

After implementing divide zero exception the kernel start rebooting. Changing the .text segment to writable stopped the restarting and seemingly fixed the problem. However, exception should not modify the .text segment so after verifying the IDT is correct I had to check what changed in .text segment.

I copied all .text data and run comparison after the fault happened. The mismatch was found:

mismatch at FFFFFF80000000C0: 20990000000000 20980000000000

Which refered to GDT code segment entry and specifically the accessed bit got flipped. The reason is that during fault the code segment is loaded, but for syscall/sysret it is not, so it was confusing.

For now, the solution was to move non-bootstrap GDT to .data segment. NOTE: bootstrap GDT is in .text segment and thus the page for starting APs must be writable.

Division by zero

When asked on IRC to about advancing RIP after division by zero I got this response :).

It's *possible*, but it's also likely to break dozens of known good assumptions about computer science and invoke a metric shitload of undefined behaviour in the process (Kazinsal #osdev)

Timer Interrupt

So far, all non-critical exceptions were handled by currently running user mode container. Unfortunately, I am not sure how to make the LAPIC timer in user mode as it requires writes to a MSR to reset it (TSC deadline), and more importantly kernel has to relinquish control in case of adversary/buggy container that is reluctant to yield on the interrupt.

Initially, I thought setting up two timers, one at higher frequency (user mode) and the other at lower frequency (kernel mode) would be neat, but LAPIC has only one timer interrupt vector. Thus, the current plan is to point the interrupt vector to kernel which will bump up penalty integer (if next quantum is a different container) and returns to present container. Since the registers on context switch are saved by containers, kernel must (I think) use either user mode stack (can be read-only) or registers to at minimum save current RIP of the user container before jumping to the handler. If sysretq is used, RCX and R11 must be saved on user stack in addition, thus to better mimick traditional exception, a user read-only stack seems as a better solution (there is the GS/FS register too, but I rather have user complete freedom and not reserve any registers - probably makes it more portable too). At the end, I decided to create read-only container environment (basically, shared memory with kernel) that can be used to share information such as, the current penalty count, last RIP (on interrupt), next quantum container id, and other different kernel counters for active container. This area requires one frame (2 MiB) and is mapped as write into kernel, and as read into user. The policy of kernel never reading any user mode memory (only registers are allowed) is still enforced.

After kernel returns to the user mode interrupt handler, the container is expected to save its context if it is necessary and yield. There are exceptions, for example if the next quantum is the same container, there is no need to save anything nor yield, the container can restore the stack pointer and jump back to the saved or different RIP. This mechanism can be used to implement user space threads by simply allocating few consecutive quanta (such as 10, each 1 ms long) and yield only on the last one, so the kernel can switch to different container.

Normally, I use null segment for SS, but when returning from CPL < 3 into CPL = 3 via iretq, the SS will get reloaded and will cause general protection fault. Thus a not null selector must be created in GDT and SS must be loaded with that selector index. We use a simple trick where SS gets loaded with proper selector after executing sysretq instruction (starts user mode container). It is possible to use sysretq, but then two extra registers needs to be stored in info (rcx and r11); with iretq we need to store only rip. For user space exceptions we do not follow the manual and avoid iretq as it would set the accesses bit in GDT entry and page fault. (TODO: manual explicitly requires to use iretq)

TODO: handle what happens if container's interrupt handler gets interrupted (I am thinking about unique ID written by kernel into info as RIP can't be used reliably to identify interrupts). The container should be penalized (and eventually killed), but more importantly, there must be a plan for handling the user space stack in sane way (and not losing some space on it, due to RSP being bump in handler, but not decremented).

Protocol:

  1. interrupt jumps into kernel handler
  2. if current container penalty is 3, kill it and schedule next container
  3. increase penalty
  4. update RIP in the info
  5. set next LAPIC deadline
  6. jump into user handler
  7. user is expected to save registers and yield
  8. decrease penalty (and if negative set to 0)
  9. schedule next container

Notice, the penalty indicates if there is any nesting in timer interrupt. A well behaving program should always have penalty equal 1 when entering timer interrupt handler.

Since the user mode bootstrap container must handle timer interrupt, we parse the ELF symbol table and search for timer_interrupt_asm assembly function. This function is then called by kernel timer interrupt handler.

The kernel interrupt handler is fully written in assembly to avoid any surprises in terms of register modifications. The current implementation needs to push and pop only 3 registers, RAX, RCX, and RDX.

Compile Flags

-mno-red-zone (disable red zone as interrupts do not respect convention of leaving extra stack space for leaf functions)

Tasks

References