Useless Operating System
In my final project for the advanced operating system class taught by John Regehr, I implemented parts of an operating system inspired by exo/micro kernel design.
The overall goal is to expose hardware as much as possible while providing protection.
I decided to stick with amd64 architecture as porting to aarch64 seems unfeasible given the time frame and I primarily use amd64 CPUs so I want to know more about them. UOS does not do identity mapping of physical memory outside boot thus kernel will be using virtual memory in almost same way as user mode programs.
After implementing small OS I recommend it to all low-level or high performance oriented programmers. More importantly, OS development is joy, especially the design of interfaces and low-level tinkering. Unfortunately, developing a usable operating system is no small task (all those drivers) and user mode applications severely constrain the kernel design. I therefore moved to compilers and programming languages, which are tractable for a single person (in the days of LLVM), and can significantly improve ones life by making programming fun againk
Also, the development made me finally appreciate debugger, where simpler printf can’t be used as registers need to be inspected before and after instruction execution, e.g., before and after sysret
.
Important
- disable red zone with compiler flag
-mno-red-zone
(can get corrupted in nested interrupts) - make sure pointers are in canonical form
Motivation
In the class we used xv6, a clean Unix clone with approachable code. However, adding VESA BIOS Extensions framebuffer required new system calls and kernel changes which was clunky. I especially disliked that any experiment required tinkering with the kernel, and provided abstractions were quite high level.
Moreover, the xv6 operating system runs on 32-bit architecture, but most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is much simpler as it removes tons of old cruft such as segmentation and has simple system calls.
The primary inspiration for UOS are MIT’s JOS (micro/exo kernel) and a bit L4 (microkernel). For example, almost all exceptions are handled by user and not kernel, including page fault. The timer interrupt is forwarded to user, but in case of buggy/malicious container, kernel will wait few slots and then kill the misbehaving container. Since the user knows more about its use of registers, it can save only the necessary ones, e.g., may not save floating point registers. On context switch, kernel stores only instruction and stack pointers.
I decided to go with capability-based permission system, where each resource is associated with a “lock” and the key to the lock consists of hashed permissions and the lock itself. Kernel/user can easily verify the key valid by hashing the lock and permissions of the key and checking if the key’s hash matches. This approaches makes kernel simpler, as memory that can’t be exposed to user (e.g., < 1 MB) is simply associated with a lock that only kernel knows. Therefore, no special range checks are required in the kernel when user maps memory. Shared memory is also simple to implement, as a container can easily create keys for the memory it owns and give it to a newly created container. Also, kernel uses same interface as user, i.e., it presents key if it needs to map memory. TODO: how to safely store key if Meltdown is present?
UOS in constrast to Linux does not have a text mode. All assembly is in separate files as I am convinced inline assembly is not good solution. Moreover, kernel consists of only single .c
file, which simplifies the build system to a single compile command. It targets desktops and uses 2 MiB pages.
Amd64 Architecture
Most desktops or laptops contain amd64 processor either from Intel or Amd. Since UOS targets personal computers it supports only Amd64 architecture. This choice simplifies the kernel as 32bit architectures have much more legacy cruft. For example, Amd64 has no segmentation, good syscall interface, simpler instruction set and only one necessary task segment. The arguments for C ABI are passed in registers. UOS does not allow kernel to read any user memory without explicit mapping. Moreover, since there is more registers, all syscall arguments must use them.
Any virtual address must be in canonical form, i.e., if bit 47 (counting from 0) is set, then all most significant bits must be also set. This quirk divides the address space into 2 pieces with huge hole in between them. Pretty annoying as UOS is not higher half kernel.
The System V ABI requires first 6 integer or pointer arguments to be passed in registers in order rdi
, rsi
, rdx
, rcx
, r8
, and r9
. The remaining arguments (7 and more) are passed on stack.
Overview of Startup
At boot, Basic Input/Output System (BIOS) loads specific sector of 512 bytes into memory at fixed address and starts execution. Usually, this program is a bootloader that loads the kernel, or in our case second stage of the bootloader as 64 bit assembly takes more space and I could not fit the ELF loader into a single sector. The second stage parses ELF file (kernel) on disk and loads it into memory. The kernel bootstraps itself by setting up physical and virtual memory managers and core struct for the boot processor. It then switches to new page table and stack and continues by starting other processors and loading user init program into a container. We will now discuss individual stages in more detail.
Bootloader
We focus on using BIOS instead of UEFI due to simplicity of the setup as there is no need for additional tools to build BIOS bootable binary. The disadvantage of BIOS is all the legacy cruft, such as starting in Real Mode and having only 512 bytes for bootloader.
The purpose of bootloader is to do initial setup (such as video mode, virtual memory) and load kernel from a disk into a memory. Since we could not squeeze bootloader with ELF loader into 512 bytes like xv6, we split the bootloader into two stages, where the first stage’s purpose is to only load the second stage. For now, we use 4 KiB of space for both stages and kernel starts at sector 8.
After the video mode and long mode setup, the bootloader reads kernel from ELF binary into predefined location and jumps into the kernel entry point. IDE driver is used to perform reading from a disk. UOS does not support multiboot standard as xv6, so it will work only with our bootloader.
The first stage must be 512 bytes and the assembly needs to be padded to fill 512 bytes (and adding MBR bytes at 510 offset).
A20 Line
Used to ignore 21st bit of address to emulate wraparound from older architectures. We have to enable this 21st bit so we can address all memory.
xv6 bootloader uses keyboard controller to enable this bit. We opt for BIOS and its functions.
movw $0x2401, %ax
int $0x15
http://www.win.tue.nl/~aeb/linux/kbd/A20.html
Global Descriptor Table
entry point of bootblock.o
must be at the address 0x7C00
which can be double checked by doing objdump -d bootblock.o
The simplest “debugging” can be done by writing to screen; segmentation must be used as CGA address of 0xB8000
is outside 16-bit range. This addressing can be done by setting %ds
to 0xB800
(segment is multiplied by 16 on access).
We directly transition from real mode to long mode without going through protected mode. Since the boot sector is small (512 bytes) we have to manually set up page tables instead of loading them from an array.
We can’t use .data segment in bootloader C code (for example to store static string) as only .text is copied with objcopy
(-j
flag; TODO).
Bootstrap Page Table
We map same physical address [0, 1 GiB) to the virtual addresses [0x0, 1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as we linked kernel to reside at address 0x8000100000 (as we want it to be mapped into each process space) and when we jump to it it must be mapped properly. The first 1 GiB in virtual address space are mapped because it is the space in which bootloader lives and from where we will perform the jump.
The reason for choosing address where all 39 bits are zero is so we can have only top level page directory (PML4) which shares mapping with the [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry to the same entry in PDP (next level page table) as for the first entry.
direct switch to long mode
Real Hardware
UOS does not boot on my desktop as it would need SATA driver. If booting from USB, tt turns out if the disk lacks partition table the BIOS will set DL
register to 0x0
, i.e., floppy, and floppy does not support linear base addressing (LBA) which we use in IDE driver.
ELF
Fixed-length ELF header and variable-length program header listing
.text
program instructions.rodata
read-only data (we should make it only read-only memory).data
program’s data (e.g., global variables).bss
is a reserved section for unitialized global variables (we need to zero them)VMA
is link addressLMA
is load address of.text
section (as the offsets in code are absolute due to not using Position Independent Code)
For some stupid reason .rodata is together with .text in same segment, making .rodata executable. (http://thiébaud.fr/rodata.html)
Symbol Table
Symbol table associates names with linear addresses and thus can be used by the loader to find addresses of exception and interrupt handlers (those must have some agreed names).
Drivers
UART
Serial port for debugging. Uses port-mapped I/O which ignores virtual memory (thus even if we do not map first 2 MiB it works).
IDE driver
We need to read kernel from the primary ATA disk in bootloader to initialize the core OS.
We wrote the function for reading sectors in assembly to avoid inline assembly. Also, we could wrap individual instructions in assembly functions but it was much simpler this way.
Kernel Bootstrap
Mostly disabling legacy stuff and enabling modern features as the processor assumes it woke up in last century.
Memory Management
- frame allocator (2 MiB)
- avoid mapping physical memory (higher half)
- tried recursive paging (pain to modify other processes, TLB pressure)
- ended up with linear/physical dual mapped page tables
- kernel pages marked as global
- kernel heap (used for most allocations)
There are roughly three levels of memory management:
- physical memory
- virtual memory
- heap
We use recursive page table mapping to avoid the need to remap page tables each time we need to access them or to use contiguous memory storage for the VA to PA and PA to VA conversions which are necessary for updating the page table. The recursive mapping is the simplest as we need these conversions only for page table manipulation. In other cases such as memory mapped IO we would use a simple mapping.
Hmm, recursive mapping allows us to access entries of page table only for currently loaded page table which is really annoying.
There are several approaches:
- remap physical memory to kernel and use simple +-KERNEL_BASE
- use recursive page table (may use it for user space)
- keep small space available and map the page table before modifying it
- store virtual and physical addresses for page table (have to keep them in sync, allows for custom format which gets rid of canonical address issues NX bit, problems with dirty and accessed flags)
Recursive Page Table
By mapping pml4 entry to pml4 we can access all paging structures for current process. The main advantage is the simplicity and constant virtual address space used in kernel (since only pml4 is mapped in kernel space). If we need to modify another process’s page table we can point it to a pml4 slot and use the recursive trick on pdp (which is pml4 of the other process).
Unfortunately, there are many disadvantages:
- since we use 2 MiB pages we have to be careful as the recursive mapping behaves as if we used 4 KiB pages
- large TLB footprint (as we use 2 MiB pages)
- clunky to modify other process page table (the pdp becomes pml4, …) - rare, but needed for container creation
- weird bootstrap during boot
- limited to x86 (not big deal for now)
Duplicated Page Table
For kernel we have a separate page table data structure that mimics hardware page table.
TODO: use completely different data layout so we can easily traverse the pointers from pml4 -> pdp -> pd.
For user space we may use recursive mapping to allow easy view of the current virtual address space. (TODO: what about TLB pollution? We can’t easily map page tables as they live in kernel heap)
Backing pages cannot be moved without knowledge of the heap to avoid having a separate structure for page tables.
Initialization
Since at boot we do not have valid kernel page table we have to build one. This build process is quite different from modifying current page table (one in CR3) as we can’t use recursive mapping to access the pml4, pdp, and pd tables. Furthermore, since the page table is not used we do not need to invalidate TLB entries.
We assume the heap is physically contiguous at the boot time and use a simple offset to setup a initial kernel page table. Alternative solution would be to walk the boot page directory and find linear address for a given physical address (the other way is handled by recursive mapping).
Manager
Still go back and forth between the options but leaning towards thin layer in kernel which will provide frame management and ability to map virtual to physical addresses. The kernel would enforce permissions.
The other options are:
- provide only sbrk to grow heap (pain in ass to map any specific frames)
- allocate fixed number of frames for kernel and give rest of frames to user space memory manager (two context switches, painful)
- I can’t think of anything else
Security
If I go with 2 MiB pages than we need at least 4 pages for a program (8 MiB):
.text
, read-only.rodata
, read-only, no-execute.data
, write, no-execute- stack, write, no-execute
Heap
I started with a simple heap that allocates multiple of 4096 bytes, but after watching Switch hacking talk (TODO reference) I plan to have a separate allocator (slab) for each struct type, thus avoiding the potential vulnerability of overwriting different struct that can happen on heap (as two different types of struct cannot possibly overlap, and also there are no pointer casting issues).
Security
- NX, WP, SMAP, SMEP
- guard pages
- proper permissions for elf segments
- capability-based permission (kernel/user are “same”)
The UOS is a exo/micro kernel design and thus it runs all drivers in user mode (ring 3).
Kernel is prohibited to write to any user mapped memory directly and must map the memory to separate address if it wants to write there (for example for the info
structure). Kernel never reads any user memory and all syscall arguments must be passed in registers.
The recent side-channel attacks are concerning and I will probably go with the separate page table for kernel and user (with the addition of PCID). The Spectre attack seem much harder to mitigate. The attack allows user to read all memory, thus it could read kernel key and corresponding capabilities created from the key. One solution would be to store the keys and capabilities in a special area not mapped when user space is in flight (probably hard to not make a mistake); could we create kernel where everything is non-sensitive?.
he UOS is a exo/micro kernel design and thus it runs all drivers in user mode (ring 3).
Kernel is prohibited to write to any user mapped memory directly and must map the memory to separate address if it wants to write there (for example for the info
structure). Kernel never reads any user memory and all syscall arguments must be passed in registers.
The recent side-channel attacks are concerning and I will probably go with the separate page table for kernel and user (with the addition of PCID). The Spectre attack seem much harder to mitigate. The attack allows user to read all memory, thus it could read kernel key and corresponding capabilities created from the key. One solution would be to store the keys and capabilities in a special area not mapped when user space is in flight (probably hard to not make a mistake); could we create kernel where everything is non-sensitive?.
Capability-based Mechanisms
Instead of user/group based system have fine grained badges that give owner permissions. Access control lists (ACL) store list of things that have access to given resource. Capabilities store list of resources a thing can access.
Strive for simplicity
c-list (capability list), trapdoor functions?
random numbers (rdrand, rdseed)
How to do these?
- capability per frame
- capability per core
Anton Burtsev, http://www.ics.uci.edu/~aburtsev/
References:
- https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html
- https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
Design
Initial approach was to have each frame associated with capability, thus when use allocated a frame he would get a key to it. This method was clunky as it would be tricky to allocate multiple frames and it placed large burden onto user. Furthermore, we were associating the object number with the key to identify it, which caused same problems.
Instead, we use user provided key (can be random number) when a frame is acquired, thus the same key can be used with any number of frames. User can create capability by hashing the key and permissions. Then when any access such as mapping the frame is performed, OS performs same hashing and check if the hash in capability matches the computed one.
TODO: test if we can use AES instead of sha256
Multicore
- ACPI table to get list of cores
- x2APIC (MSR)
- GS register
- one kernel stack per core
- create container in user space (aka fork)
There are at least two ways to detect Application Processors (AP) after Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor Specification tables or ACPI (specifically MADT).
xv6 uses MPS tables which are obsolete, so we focus on the ACPI approach. The ACPI table pointer structure lives at some memory range and can be checksumed to verify if the struct is really an ACPI table. Since Qemu does not support x2APIC we have to memory map the LAPIC registers to query the APIC id (other alternative is to use CPUID). The ACPI table tells us this address and moreover it has a flag if core is working (the table is filled during boot by each core).
After parsing the tables we loop through all core entries and if the LAPIC id is different from the id of BSP we attempt to start the core with interprocess interrupt. Then we wait until the core updates the shared core array with running flag. We could do exponential startup for large core counts, but currently it is not an issue. Note, we use x2APIC which uses MSR as an interface instead of memory mapping.
Each core has its own kernel stack and pml4 with shared entries for the kernel data structures (including heap) and for the entry which holds all core kernel stacks (TODO: should cores see other cores stacks?).
User Mode
- params passed only in registers
- fast syscall/sysret
- low level (map memory, create container)
Graphics
The simplest way to draw pixels is uisng VGA mode 0x13
(320x200 with 256 colors with linear addressing) with BIOS interrupt 0x10
in protected mode. This mode switch is done in bootloader before protected mode is entered as only real mode can sanely interact with BIOS to manipulate VGA graphics.
# two moves can be combined into movw $0x0013, %ax
movb $0x0, %ah # call function 0x0 to set video mode
movb $0x13, %al # use video mode 0x13
# or movw $0x13, %ax
int $0x10
After changing video mode, the pixel can be accessed at physical address 0xA0000
, which in xv6 is mapped at KERNBASE + 0xA0000
.
Unfortunately, the small resolution (and only 256 colors) provided by VGA is restrictive and unsuitable for any graphics based system.
The simplest solution is to use VESA BIOS Extensions (VBE) which can provide more than 1kx1k resolutions with up to 16M colors. Even though modes defined by VESA are considered deprecated, they are likely to work and allow to avoid finding suitable mode. This predefined modes come handy as real mode needs to be used and xv6’s boot code can’t be more than 512 bytes. Other solution would be to drop down to real mode before switching to long mode (TODO).
For example, setting 1280x1024 with 16M colors and loading its information.
movw $0x4F02, %ax # set VBE mode function
movw $0x411B, %bx # 1280x1024 24bit mode (with linear framebuffer)
int $0x10
movw $0x4F01, %ax # get mode information function
movw $0x411B, %bx # mode
movw $0x0, %di # store the mode information struct (256 bytes) at physical address 0x0
int $0x10
Than we can qeury mode parameters by loading the ModeInfoBlock
structure. For simplicity, we load it at fixed physical address 0x0
as kernel code starts on address 0x100000
. Since we care only about the PhysBasePtr
field (tells us location of linear framebuffer), we could pass it as function parameter to the kernel bootstrap function.
Now, the tricky bit; we need to create virtual memory mapping to the physical address in PhysBasePtr
. On my machine in Qemu, the address is 0xFD000000
and we need to map 4 MiB (depends on resolution and bit depth) up to the address 0xFE000000
(defined in memlayout.h
as DEVSPACE
, i.e., memory mapped I/O).
I used mappages
(vm.c
) in function main
(main.c
) to map some fixed virtual address in FRAMEBUFFER
to physical address at vbe->physbaseptr
before starting other processors.
struct vbemodeinfo const *const vbe = (void *)KERNBASE;
int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr, PTE_W);
Now, pixels can be written at the address FRAMEBUFFER
anywhere in kernel. I was surprised I could not find any exported function to perform the memory mapping explicitly, another reason for exokernel.
Floating Points
By default disabled on amd64
We can test in bootloader first if AVX2 is supported, and if not hang the OS. Some OS will trap if user used floating point unit and enable it. This mechanism allows OS to avoid storing/restoring floating point registers during context switch if process does not use those registers. Since UOS kernel does not save context at all (except few registers on timer interrupt), it enables AVX2 at the kernel startup and let’s user to do register store/restore if necessary. In fact, this approach is simpler as user knows what registers are being used and can do some tricks. Moreover, since kernel does not use floating point registers, user can pass values in registers to the next scheduled container.
Kernel
We map kernel at high virtual address as user space programs assume loading at address 0.
Global Descriptor Table
The GDT from bootloader has only kernel code segment and no user code segment, nor Task State Segment (TSS). Initially, we had one GDT shared by all cores, but each core needs its own TSS, thus we could either create an entry for each core, or better, duplicate whole GDT to obtain full flexibility at mere cost of extra 32 bytes per core. Keeping the GDT separate is the right step into the direction of sharing as little as possible between cores to scale well on many core architectures (> 32 cores).
Memory Layout
UOS’s kernel does not remap physical memory contiguously (higher half) as xv6 does. Instead, the kernel uses the same interfaces as user programs, allocating frame and mapping it. In linear address space, the kernel lives at entry 511 (last) in PML4 page table and thus takes 512 GB. We chose to take whole entry for simplicity as it can be easily shared by all user programs. Moreover, we mark the entry as global to avoid flushing TLB on context switch.
Each core’s stack lives in entry 510 and is again marked global. Stacks are one page large (2 MiB), and have guard pages interspersed between the stacks. We do not have kernel stack per container as there are no slow system calls in the kernel (such as IO) that would need to be interrupted. This push of expensive operations to user space makes kernel simpler and easier to reason about.
Linking
Containers
Instead of notion of a traditional UNIX process, UOS uses containers as its execution unit abstraction. A container is akin to a process, but it also has access to exceptions, low-level memory management, and scheduling. For example, user can easily implement preemtive threads inside a container.
In short, container has following entries:
- unique identifier
- PML4 pointer to a page table
- frame for storing
rsp
andrip
during system call - status (running, …)
- penalty (increased on each tick, decreased on yield)
In contrast to xv6, we do not have any scheduler thread, so the kernel thread becomes a container after bootstrap and if container is killed, next container is scheduled (or in lack of container kernel panics).
User Space Bootstrap
The initial user mode binary is embedded into the kernel as there is no disk driver in kernel. All binaries use custom link script so there are only .text, .rodata, and .data, and they have proper permissions so they can be loaded into separate pages.
The ELF file is parsed for timer interrupt handler which needs to be passed to function allocating quantum (so it will be scheduled to run).
System Calls
xv6 uses interrupt 0x10
for system calls. We use the native syscall
/sysret
instructions present on amd64 architectures specifically designed for system calls.
Since the syscall
instruction uses rcx
register to store the returning instruction pointer we copy the value in the register to caller saved register and restore it back in the syscall to preserve C calling convention.
We may need to move away from C calling convention due the need of having more than 6 function arguments for the syscall (7-th is on the stack and uOS is prohibited from touching any user memory).
UOS relies on a C calling convention for system calls, but it must clear caller saved registers before returning to the user as it could leak information from the kernel (e.g., private key).
Entering User Mode
Since amd64 has no “direct” instruction to switch to user mode we have to pretend that the container is returning either from interrupt (iretq
) or system call (sysretq
). We used the system call intruction as it does not touch user’s stack, thus we can avoid writing to user’s memory (which we avoid for security reasons).
Traps and Interrupts
256 slots total.
Must be acknowledged
Priorities, higher is handled first
TODO: do software excetions need to be acknowledged?
Needs one Task State Segment (TSS)
User space exception list:
- #DE - divide by zero (vector 0)
- #PF - page fault (vector 14) - TODO: user can’t read CR2; should kernel push CR2?
- #XF - SIMD floating point (vector 19)
User space rare exceptions (which we let double fault):
- #SS - stack exception (vector 12) - TODO: detects canonical address issue
- #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions and very rare
- #GP - general protection (vector 13) - TODO: mostly for unaligned SSE
- #AC - alignment check (vector 17) - can happen only if CPL = 3
Kernel space exception list:
- #DF - double fault (vector 8)
- #MC - machine check (vector 18)
Legacy exceptions:
- Non-maskable interrupt (vector 2)
- #OF - overflow (vector 4)
- #BR - bound range (vector 5)
- #NM - device not available (vector 7)
- #TS - invalid TSS (vector 10)
- #NP - segment not present (vector 11)
- #MF - x87 floating point (vector 16)
Story
Thanks to #osdev for help.
After implementing divide zero exception the kernel start rebooting. Changing the .text segment to writable stopped the restarting and seemingly fixed the problem. However, exception should not modify the .text segment so after verifying the IDT is correct I had to check what changed in .text segment.
I copied all .text
data and run comparison after the fault happened. The mismatch was found:
mismatch at FFFFFF80000000C0: 20990000000000 20980000000000
Which refered to GDT code segment entry and specifically the accessed bit got flipped. The reason is that during fault the code segment is loaded, but for syscall/sysret it is not, so it was confusing.
For now, the solution was to move non-bootstrap GDT to .data segment. NOTE: bootstrap GDT is in .text segment and thus the page for starting APs must be writable.
Division by zero
When asked on IRC to about advancing RIP after division by zero I got this response :).
It's *possible*, but it's also likely to break dozens of known good assumptions about computer science and invoke a metric shitload of undefined behaviour in the process
(Kazinsal #osdev)
Timer Interrupt
So far, all non-critical exceptions were handled by currently running user mode container. Unfortunately, I am not sure how to make the LAPIC timer in user mode as it requires writes to a MSR to reset it (TSC deadline), and more importantly kernel has to relinquish control in case of adversary/buggy container that is reluctant to yield on the interrupt.
Initially, I thought setting up two timers, one at higher frequency (user mode) and the other at lower frequency (kernel mode) would be neat, but LAPIC has only one timer interrupt vector. Thus, the current plan is to point the interrupt vector to kernel which will bump up penalty integer (if next quantum is a different container) and returns to present container. Since the registers on context switch are saved by containers, kernel must (I think) use either user mode stack (can be read-only) or registers to at minimum save current RIP of the user container before jumping to the handler. If sysretq
is used, RCX and R11 must be saved on user stack in addition, thus to better mimick traditional exception, a user read-only stack seems as a better solution (there is the GS/FS register too, but I rather have user complete freedom and not reserve any registers - probably makes it more portable too). At the end, I decided to create read-only container environment (basically, shared memory with kernel) that can be used to share information such as, the current penalty count, last RIP (on interrupt), next quantum container id, and other different kernel counters for active container. This area requires one frame (2 MiB) and is mapped as write into kernel, and as read into user. The policy of kernel never reading any user mode memory (only registers are allowed) is still enforced.
After kernel returns to the user mode interrupt handler, the container is expected to save its context if it is necessary and yield. There are exceptions, for example if the next quantum is the same container, there is no need to save anything nor yield, the container can restore the stack pointer and jump back to the saved or different RIP. This mechanism can be used to implement user space threads by simply allocating few consecutive quanta (such as 10, each 1 ms long) and yield only on the last one, so the kernel can switch to different container.
Normally, I use null segment for SS, but when returning from CPL < 3 into CPL = 3 via iretq
, the SS will get reloaded and will cause general protection fault. Thus a not null selector must be created in GDT and SS must be loaded with that selector index. We use a simple trick where SS gets loaded with proper selector after executing sysretq
instruction (starts user mode container). It is possible to use sysretq
, but then two extra registers needs to be stored in info
(rcx
and r11
); with iretq
we need to store only rip
. For user space exceptions we do not follow the manual and avoid iretq
as it would set the accesses bit in GDT entry and page fault. (TODO: manual explicitly requires to use iretq
)
TODO: handle what happens if container’s interrupt handler gets interrupted (I am thinking about unique ID written by kernel into info
as RIP can’t be used reliably to identify interrupts). The container should be penalized (and eventually killed), but more importantly, there must be a plan for handling the user space stack in sane way (and not losing some space on it, due to RSP being bump in handler, but not decremented).
Protocol:
- interrupt jumps into kernel handler
- if current container penalty is 3, kill it and schedule next container
- increase penalty
- update RIP in the
info
- set next LAPIC deadline
- jump into user handler
- user is expected to save registers and yield
- decrease penalty (and if negative set to 0)
- schedule next container
Notice, the penalty indicates if there is any nesting in timer interrupt. A well behaving program should always have penalty equal 1 when entering timer interrupt handler.
Since the user mode bootstrap container must handle timer interrupt, we parse the ELF symbol table and search for timer_interrupt_asm
assembly function. This function is then called by kernel timer interrupt handler.
The kernel interrupt handler is fully written in assembly to avoid any surprises in terms of register modifications. The current implementation needs to push and pop only 3 registers, RAX
, RCX
, and RDX
.
Compile Flags
-mno-red-zone (disable red zone as interrupts do not respect convention of leaving extra stack space for leaf functions)
Tasks
- tests
- frame allocator
- sha256
- style
- header only
- Intel-style assembly
- bootloader
- load rest of bootloader with BIOS (sort of 2 stage)
- switch from real mode to long mode
- query video modes and save this information at fixed address
- switch to most suitable video mode
- load ELF binary containing kernel (uses IDE)
- switch to SATA?
- boot on real hardware (needs SATA)
- kernel
- make work on KVM (needed to disable SMAP on Haswell)
- [?] add performance counters to evaluate impact of recursive paging on TLB misses
- abandoned recursive paging due to clunky mapping of other page tables
- start application processors (AP)
- create kernel process for each application processor (AP)
- add spinlock (test it by cores racing on it)
- use RDGSBASE to get pointer to core structure instead of the current offset stuff
- check required feature set with CPUID
- SKINIT instruction (read Security section in manual)
-
restructure so the
end
is at 16 MiB boundary - avoid global variables
- use ILP64 mode (sizeof (int) == 8) - seems hard as there is no compiler support
- timer (local APIC; TSC-Deadline mode)
- kernel double fault handler
- enable SCE (system call extensions) in EFER
- add system call support (syscall)
- syscall user rip security (validate it is canonical?)
- add containers (similar to process abstraction)
- load user space bootstrap ELF embedded inside kernel
- page permissions for ELF segments
- parse ELF symbol table to find timer interrupt handler
- switch to user space (via sysretq)
- use x2APIC (through MSR) instead of old xAPIC
- wait 10 ms on AP startup
- IPC (async to avoid deadlocks…)
- AVX2
- one TSS
-
sleep idle cores (
hlt
instructions) - expose some data structures (and update during task switch?)
- Continuation style system (TODO: read papers on it)
- init system (so servers can reserve resources)
- struct type for physical addresses (what about linear addresses? always a pointer?)
- add HPET support (to compute TSC rate)
- compute TSC rate for each core independently (good idea?)
- destroy container on double fault or other error
- memory management (only 2 MiB pages, thus only PML4, PDP, and PD is used)
- boot page table
- kernel page table
- recursive PML4 (no longer used for kernel space)
- frame allocator
- page allocator
- heap allocator (only alloc, no free)
- duplicated page table (allows easy traversal of any page table - even inactive one)
- page mapping interface (global marked pages for kernel to avoid TLB invalidation; write protect bit CR0)
- alignment checking in user space?
- Supervisor-Mode Execution Prevention (CR4.SMEP)
- non-executable pages (NXE in EFER)
- TLB (tagged)
- shared memory (uses capabilities)
- replace heap with slab allocator
- scheduling
- penalty protocol for wonky containers
- round-robin scheduler
- container migration to other core
- syscalls
- log to UART
- frame alloc/free
- map/unmap
- alloc/free container
- quanta alloc, yield
- userspace register store
- debugging
- setup .gdbinit for automatic GDB startup
- security
- separate page table for user and kernel (KAISER)
- use guard page for each core kernel stack
- proper page flags for .text, .rodata, .data
- capability-based resource management
- separate memory pools for each struct type (attacker can only modify same type of struct)
- MTTR (prevent DMA to write into kernel memory)
- kernel ASLR (both linear and physical address; PIC needed)
- random location for user stack
- finer granularity permissions for capabilities
- check if AES can be used instead of sha256
- namespaces?
- string type with length?
- stack protection
- do not map kernel into user space? (recent patches in NT and Linux do it)
- libs
- sprintf
- sha256 routine
-
random numbers (via
rdrand
instruction) - strcmp
- user space
- simple game of life render test
- exceptions (divide by zero, page fault, …)
- timer interrupt
- compositor
- ACPI
- parse MADT for detecting CPUs
- build parsed struct with LAPICs
- parse x2APIC (Qemu unfortunately has ACPI 1.0 only)
- drivers
- SATA (high priority)
- basic serial port
- IDE
- keyboard (needs USB stack)
- mouse (needs USB stack)
- Intel GPU driver (blitting support)
- graphics
- GUI library
- tiling manager
- file system
- sector-based addressing (similar to page frames)
- memory mapped filesystem?
- some filesystem format
- networking (not likely before Christmas)
- ethernet driver
- packet filter (downloadable?)
- JIT (use wasm or some other typesafe IR and thus run everything at ring 0)
- interpreter of a subset of WASM
References
- Exokernel: An Operating System Architecture for Application-Level Resource Management
- Singularity: Rethinking the Software Stack
- Amoeba: a distributed operating system for the 1990s
- Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
- A Comparison of Scheduling Algorithms for Multiprocessors
- L4 Microkernels: The Lessons from 20 Years of Research and Deployment
- xv6, JOS, seL4
- osdev wiki
- Operating Systems Design and Implementation
- AMD64 Volume 1, 2, 3
- ATA Interface Reference Manual
- Advanced Configuration and Power Interface (ACPI) Specification
- ELF-64 Object File Format
- VESA BIOS EXTENSION (VBE) Core Functions Standard
- The little book about OS development
- Reading privileged memory with a side-channel
- Kaiser
- Hyperkernel: Push-Button Verification of an OS Kernel
- Dune: Safe User-level Access to Privileged CPU Features
- Modern Operating Systems - 3rd edition, A. Tanenbaum