C Is Not Portable Assembly

02 Aug 2015

C is often referred to as a “portable assembly language.” There’s an implication there (among others) that it’s only high-level enough to have a compiler that can target multiple instruction sets. This is simply untrue — while C is certainly low-level compared to most modern languages, it has a number of features that make it profoundly higher-level than any assembly language. I’m going to talk about just one of them: structs.

I’m using as a case study a piece of code that I’ve had an unreasonable number of bugs in, which I blame in no small part on me “thinking in assembly.” The code in question is the thread creation routine in my toy operating system. What it has to do is the following:

  1. Allocate space for the state of the thread. Mostly this is just a stack, but it also includes a node in a linked list used by the scheduler, and may include space for some accounting information in the future. For the sake of discussion, let’s call this extra chunk of space the “process control block (PCB)” — I don’t have a very specific name for it in my code yet (the struct is just called Thread), but that’s a very common name for this sort of thing.
  2. Set up the thread so that the scheduler can successfully run it. This means putting the above allocated resources into a state such that when the scheduler “restarts” the “interrupted” routine, it jumps to where it’s supposed to and goes on its way.

Right now I’m only scheduling kernel threads (no userspace), and am only dealing with x86. In that environment, when a thread is interrupted:

  1. The CPU saves a few of things onto the stack, then jumps to an interrupt handler (while still using the same stack).
  2. My common interrupt handler code saves a few more things onto the stack (everything else needed to resume execution) before passing the stack pointer to the scheduler.
  3. The scheduler saves the stack pointer in the PCB for the process that was supposedly running, and then pushes it to the back of the queue, to be restarted at a later time.
  4. The scheduler retrieves a previously interrupted thread from the front of the queue, gets the stack pointer from it’s PCB, and returns it.
  5. Without touching the state it saved on the stack in step (2), the common interrupt handler code switches to the newly returned stack.
  6. The interrupt handler restores the state on the (new) stack, and returns from the interrupt. This has the effect of picking up the newly scheduled task where it left off.

Our thread creation routine takes a function pointer and a single pointer as an argument, and the thread “begins” by calling that function with that argument. We therefore need to set up a stack that “looks like” it was interrupted just as it was calling this function. This means we should have the argument just underneath a return address (which can be bogus; we can ask our threads to do something special if they want to exit, and forbid them from returning). Then, we need to manufacture all of that saved state. This includes things like segment registers and EFLAGS, as well as general purpose registers and (importantly) the instruction pointer. We also need to set up frame pointers, so backtraces look reasonable.

Going through the full detail of exactly what each piece of that spoofed stack is supposed to look like isn’t really the point of this post though. What does the code look like?

There’s one other important machine-related detail to pay attention to with this code: the stack grows down, meaning that the stack starts at numerically higher addresses, and as you push things on it the stack pointer decreases. This makes things a little more awkward to work with in C, since it treats most things the other way around — The beginnings of arrays, structs, etc are at low addresses, and the ends are at high addresses.

I went through a couple of iterations before I came to my senses and got something that both worked and seemed reasonably maintainable. Version one was along these lines:

uint32_t *stack = kalloc_align(stack_size, STACK_ALIGNMENT);
Thread *ret = kalloc(sizeof(Thread));

stack = (uint32_t *)((uint32_t)stack + stack_size);
stack[-1] = 0; // end of the backtrace "linked list"
stack[-2] = (uint32_t)data; // argument to pass to the function
stack[-3] = 0; // bogus return adderss; don't need to return

Regs *saved_ctx = (Regs *)&stack[-4];
saved_ctx->eip = (uint32_t)entry_point;
saved_ctx->ebp = (uint32_t)&stack[-1];
saved_ctx->ds = KERNEL_DATA_SEGMENT;
// ...

This had some bugs in it, some of which I fixed before I decided it was too error prone, and sought an alternate design.

Going through the code, it starts off pretty straightforward: grab the memory for the stack, and since the start is at the other side of it move the pointer. Then, start counting down from the top, putting values into it as you go.

There’s something weird going on with the saved_ctx. The Regs struct had been defined quite some time ago, and looked something like this:

typedef struct Regs Regs;
struct Regs {
    uint32_t ds, edi, esi // ... stuff I saved
    uint32_t eip, cs, eflags;   // stuff the cpu saved

I must have based this off of an example at some point, because I think if I had spent much time reflecting on why it was like that in the first place, I would have skipped straight to version two:

typedef struct NewStack NewStack;
struct NewStack {
    Regs saved_ctx;
    uint32_t thread_ret;
    uint32_t thread_arg;
    uint32_t ebp_terminator;

// ...

    void *stack = kalloc_align(stack_size, STACK_ALIGNMENT);
    Thread *ret = kalloc(sizeof(Thread));

    NewStack *stack_begin = (NewStack *)((uint32_t)stack - stack_size);
    stack_begin->ebp_terminator = 0;
    stack_begin->thread_arg = (uint32_t)data;
    stack_begin->thread_ret = 0;
    stack_begin->saved_ctx.eip = (uint32_t)entry_point;
    stack_begin->saved_ctx.ebp = (uint32_t)&stack_begin->ebp_terminator;
    stack_begin->saved_ctx.ds = KERNEL_DATA_SEGMENT;
    // ...

This is a lot nicer. For one thing, cuts down dramatically on the amount of arithmetic book keeping we need to think about. It’s also much more readable, as the parts of the stack have descriptive names. Writing it this way was a lot easier, and it shook out most of the bugs pretty quickly. I want to specifically point out this one line:

    stack_begin->saved_ctx.ebp = (uint32_t)&stack_begin->ebp_terminator;

This was one of the trickier-looking pieces of code before, but with things properly labeled, it suddenly just becomes boring singly-linked-list code.

The astute reader will note that I’ve just spent at least two solid paragraphs talking about the fact that good variable names make code more readable, as if this is a novel discovery. There is of course nothing novel about it, but this is what happens when you’re thinking in assembly, where parts of data types isn’t something you can easily do, because your language doesn’t even have a way to talk about them except as individual memory addresses.

This version is cleaner, but there’s a mistake, can you see it?

We’re putting the pointer for the struct in the wrong place! I subtracted instead of adding, so the stack is going to be constructed outside of the memory I actually allocated. This is bad.

Version three looked almost identical, but instead we were adding the stack size, and then decrementing to get us back inside the bounds of the stack (but at the very end). This version actually worked, but I took the last bug as a lesson, and sought to remove that last little bit of arithmetic from the code. Here’s the final version:

typedef struct NewStackBase NewStackBase;
typedef struct NewStack NewStack;

struct NewStackBase {
    Regs saved_ctx;
    uint32_t thread_ret;
    uint32_t thread_arg;
    uint32_t ebp_terminator;

struct NewStack {
    uint8_t extra_space[STACK_SIZE - sizeof(NewStackBase)];
    NewStackBase base;

// ...

    NewStack *stack = kalloc_align(sizeof(NewStack), STACK_ALIGNMENT);
    Thread *ret = kalloc(sizeof(Thread));

    NewStackBase *base = &stack->base;
    base->ebp_terminator = 0;
    base->thread_arg = (uint32_t)data;
    base->thread_ret = 0;
    base->saved_ctx.eip = (uint32_t)entry_point;
    base->saved_ctx.ebp = (uint32_t)&base->ebp_terminator;
    base->saved_ctx.ds = KERNEL_DATA_SEGMENT;
    // ...

Okay! Now the code actually looks fairly mundane; there’s no pointer arithmetic left at all, and it’s pretty easy to visually see that this won’t touch memory outside of what it’s allocated. This version works, and I’m actually comfortable maintaining it. I couldn’t have done anything like this in assembly — I’d be stuck doing arithmetic and needing to keep careful track of where everything is. Having data declarations built into the language like this makes it much easier to write this sort of code.

You can of course treat C as if it’s a portable assembly language, but doing so is usually unnecessary and ill-advised. The above code is doing one of the lowest-level tasks you’ll ever see, and there’s still no need for pointer arithmetic.