Exploring Go's Runtime - How a Process Bootstraps Itself - Part I

I’ve spent a considerable portion of my professional life dealing with Java's HotSpot virtual machine and decided after acquiring a strong affinity for Go that I should attempt to build as deep of an understanding of the Go runtime as well. Much to my surprise, there is little formal documentation—outside of the source, so the opportunity seems ripe to consolidate this in an accessible digest.

Let me speak briefly about the format of this post:

  • This will be a multi-part series with individual posts dedicated to discrete topics. It is unclear to me how many parts there will be, but I anticipate between two to three.
  • For the purposes of this series, I will stick to the source found in Go 1.4.1. The runtime variant under examination is GNU Linux on amd64/x86-64 virtual machine.
  • Where possible, I reference source code in Go's runtime as well as a custom-built sandbox to demonstrate the concepts. The sandbox lives here but is not strictly necessary to follow along. This first post does not use the sandbox heavily, but other posts may!

Without further ado, let’s begin Part I of this series: How a Process Bootstraps Itself. We’ll start out with the entry point of a Go binary. This task is critical for tracing the flow of execution to discover how a Go binary bootstraps itself. If you want, you can clone the sandbox and create a local copy of it for self-study:

$ git clone https://github.com/matttproud/golang_runtime_exploration
$ cd golang_runtime_exploration
$ make  # output elided

We turn ourselves to a newly-generated file called entrypoint/entrypoint.disassembled I generated from a vanilla Go program entrypoint/entrypoint.go using the objdump tool, which states

start address 0x0000000000421760

in the prologue. This is the entry point into the generated Go binary. What does this mean, and where does it come from? Since we're running on Linux, and Linux uses the ELF executable format these days, Go’s linker is obligated to define an entrypoint offset in its executable file emissions. Specifically, Go’s linker defines this offset in the entry member of the Elf32_Ehdr structure. A call inside the linker to entryvalue() populates this member from the static INITVALUE variable, which is set by interpolating a format string _rt0_%s_%s with GOARCH and GOOS constants. Now that we know where and how this entrypoint comes from, let’s return to what it means. Thusly the entrypoint.disassembled objdump at offset 421760 looks something like this (again: from objdump):

0000000000421760 <_rt0_amd64_linux>:
  421760:       48 8d 74 24 08          lea    0x8(%rsp),%rsi
  421765:       48 8b 3c 24             mov    (%rsp),%rdi
  421769:       b8 70 17 42 00          mov    $0x421770,%eax
  42176e:       ff e0                   jmpq   *%rax

This is the body of machine code that is executed when the execve call starts the binary.

Great, now how do I read this, you might ask? I won't pull your leg by saying that I'm a wonk of Assembler, but I will say that we’re in luck with mature, documented prior art: standardized process bootstrapping mechanisms, like the one that glibc defines (cf., i386 implementation). The Ksplice team at Oracle has even produced a nice writeup of their mechanics if you are so curious. The good news is instead of reading raw disassembled output, we can consult Go’s intermediate Assembler format and the canonical source in upstream since the format string above tells us to look at _rt0_amd64_linux. Bingo! The _rt0_amd64_linux symbol is defined in rt0_linux_amd64.s. Let’s trace out the bootstrapping process, focusing on the unique particulars of Go. Here’s that entrypoint again, but this time in Go’s Assembler format:

TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
 LEAQ 8(SP), SI // argv
 MOVQ 0(SP), DI // argc
 MOVQ $main(SB), AX

What it is doing here is assigning argv and argc from the stack pointer to the local registers SI and DI respectively before invoking the main procedure. Note: this procedure is not the main function in package main we defined! We’ll get to why in a later post. Rather, this main is defined in the same file as follows:

 MOVQ $runtime·rt0_go(SB), AX

The definition is pretty straight forward: execute the rt0_go procedure found in package runtime. You may ask, why the boilerplate of having the entrypoint just invoke an simple shell of a pain procedure, which, in turn, just delegates itself to the rt0_go procedure? The answer is that each GOARCH and GOOS has different initial bootstrapping dependencies and requirements (cf., Linux on ARM). At this point, rt0_go is assumed to be safely generic for all GOOS variants since it is defined in asm_amd64.s. Let’s continue the tracing by looking at this procedure closely:

// copy arguments forward on an even stack
 MOVQ DI, AX  // argc
 MOVQ SI, BX  // argv
 SUBQ $(4*8+7), SP  // 2args 2auto
 ANDQ $~15, SP
 MOVQ AX, 16(SP)
 MOVQ BX, 24(SP)

This is responsible for setting up the stack, namely program arguments once again.

Before we look at the next segment, let’s take a quick diversion into terminology. In the world of Go’s scheduler, we have a cast of acronyms G, M, and P that require explanation. To borrow the words of Daniel Morsing (from his nice blog post on the Go scheduler ca. 1.1 release):

  • G corresponds to a Goroutine (struct G).
  • M corresponds to a Machine, which can be effectively substituted with a thread of execution within the operating system (struct M). G are executed on M.
  • P corresponds to a Processor, which can be thought of as a resource in which a M runs a G (struct P).

Now that we have that out of the way, we come to this gem:

// create istack out of the given (operating system) stack.
 // _cgo_init may update stackguard.
 MOVQ $runtime·g0(SB), DI
 LEAQ (-64*1024+104)(SP), BX
 MOVQ BX, g_stackguard0(DI)
 MOVQ BX, g_stackguard1(DI)
 MOVQ BX, (g_stack+stack_lo)(DI)
 MOVQ SP, (g_stack+stack_hi)(DI)

Using what we have learned above, we can infer that g0 is the 0th Goroutine of the system and possibly performs various runtime management functions. It is defined by the struct G and has some interesting fields, a few listed here:

Stack stack; // offset known to runtime/cgo 
uintptr stackguard0; // offset known to liblink
uintptr stackguard1; // offset known to liblink

Given the Go Assembler guide, the assembly listing, and the struct G excerpts, we can interpret the istack assembly to mean the following:

  • Assign the address of g0 to DI, which will be used to scope the subsequent references.
  • Assign the address literal of -0xff98 (-64*1024+104 in hex) from the stack pointer’s origin to BX. If someone has a good idea on where this value originates, clarification would be appreciated.
  • Assign BX to g0.stackguard0 and g0.stackguard1. In the context of Go, stackguard relates to the logic behind the stack resizing methodology and when more memory is allocated. This may become a topic for a future post but is described to some detail in the Contiguous Stacks Design Document.
  • Assign BX to g0.stack.lo and the stack pointer to g0.stack.hi.

… and there we have the geometry of g0’s stack. I presume a methodology similar to this is used in creation of additional Goroutines once the scheduler has started, but I haven’t verified yet.

Continuing, we hit the CPU type and capabilities detection procedure:

// find out information about the processor we're on
 MOVQ $0, AX
 CMPQ AX, $0
 JE nocpuinfo
 MOVQ $1, AX
 MOVL CX, runtime·cpuid_ecx(SB)
 MOVL DX, runtime·cpuid_edx(SB)

If the information is available, the feature set is recorded in cpuid_ecx and cpuid_edx. A quick survey reports that the ECX is used in selecting the fundamental hashing algorithm that the runtime uses internally:

if (cpuid_ecx&(1<<25)) != 0 && // aes (aesenc)
  (cpuid_ecx&(1<<9)) != 0 && // sse3 (pshufb)
  (cpuid_ecx&(1<<19)) != 0 { // sse4.1 (pinsr{d,q})
  useAeshash = true
  algarray[alg_MEM].hash = aeshash
  algarray[alg_MEM8].hash = aeshash
  algarray[alg_MEM16].hash = aeshash
  algarray[alg_MEM32].hash = aeshash32
  algarray[alg_MEM64].hash = aeshash64
  algarray[alg_MEM128].hash = aeshash
  algarray[alg_STRING].hash = aeshashstr

These are used, for instance, when indexing a map type, which is to say that the runtime attempts to select the most optimal hashing methodology given the detected architecture, which could be AES. As for EDX, its value informs whether SSE2 is present and thusly the mechanism used for the runtime’s memclr and memmove procedures.

You can see from the hashing algorithms discussion above that a number of Go’s core runtime features and behaviors are determined by function pointer. The same is the case with Cgo, which immediately follows the CPU capabilities detection.

_cgo_init is a function pointer, defined as such:

// Filled in by dynamic linker when Cgo is available.
void (*_cgo_init)(void);

If Cgo is bundled into the build artifact, the function pointer is set:

extern void x_cgo_init(G*);
void (*_cgo_init)(G*) = x_cgo_init;

x_cgo_init is defined as follows:

x_cgo_init(G* g, void (*setg)(void*))
 pthread_attr_t attr;
 size_t size;

 setg_gcc = setg;
 pthread_attr_getstacksize(&attr, &size);
 g->stacklo = (uintptr)&attr - size + 4096;

So that leaves us to an assembly block, which handles the Cgo bootstrapping:

// if there is an _cgo_init, call it.
 MOVQ _cgo_init(SB), AX
 JZ needtls
 // g0 already in DI
 MOVQ DI, CX // Win64 uses CX for first parameter
 MOVQ $setg_gcc<>(SB), SI
 // update stackguard after _cgo_init
 MOVQ $runtime·g0(SB), CX
 MOVQ (g_stack+stack_lo)(CX), AX
 ADDQ $const_StackGuard, AX
 MOVQ AX, g_stackguard0(CX)
 MOVQ AX, g_stackguard1(CX)

So, if _cgo_init is non-null (TESTQ AX, AX), we prepare our function call, which means g0 as the first parameter and again another function pointer: setg_gcc. Remember: the Go assembler guide will help you read the peculiarities in here! setg_gcc on amd64 is defined as such:

// void setg_gcc(G*); set g called from gcc.
TEXT setg_gcc<>(SB),NOSPLIT,$0

Let’s put this rat nest together. If _cgo_init is called, …

All of this is done for the system g0 Goroutine.

Irrespective of Cgo, the runtime now sets up thread local storage (TLS), which is used to scope stateful data to a given thread, or a M per the scheduler terminology. I will admit that this is entering murky territory for me.

LEAQ runtime·tls0(SB), DI
 CALL runtime·settls(SB)

tls0 refers to a byte array and is passed as the first argument to the settls procedure, which, in turn, invokes system call 158 (sys_arch_prctl) to instruct the kernel to set g0’s thread local storage to that array.

Assuming this has been successful, the runtime performs a blackbox test of TLS capability:

// store through it, to make sure it works
 MOVQ $0x123, g(BX)
 MOVQ runtime·tls0(SB), AX
 CMPQ AX, $0x123
 JEQ 2(PC)
 MOVL AX, 0 // abort

This merely passes the 0x123 integer literal into g0 TLS, fetches the value out of TLS, and then compares for end-to-end correctness. The Go distribution’s compiler bootstrapper (different from the process we are talking about) defines a set of macros related to TLS that are worth knowing:

"#define get_tls(r) MOVQ TLS, r\n"
  "#define g(r) 0(r)(TLS*1)\n"

If you have made it this far, I congratulate you! It’s been an arduous trek.

At this point, the runtime mutually binds g0 to m0 (m0 definition):

// set the per-goroutine and per-mach "registers"
 LEAQ runtime·g0(SB), CX
 LEAQ runtime·m0(SB), AX
 // save m->g0 = g0
 MOVQ CX, m_g0(AX)
 // save m0 to g0->m
 MOVQ AX, g_m(CX)

One question that I have not answered yet is whether g0 and m0 are eternally bound. My assumption—to be explicitly validated later on—is whether this binding is done just to bootstrap the scheduler which, I presume to run on g0.

A set of mundane invariant checks follow, which are used to test environmental sanity:

Eventually argc and argv are persisted for later use in the process:

static int32 argc;

#pragma dataflag NOPTR /* argv not a heap pointer */
static uint8** argv;

runtime·args(int32 c, uint8 **v)
 argc = c;
 argv = v;
 if(runtime·sysargs != nil)
  runtime·sysargs(c, v);

The sysargs function pointer is responsible for platform-specific initializations given the arguments, and we’ll see just that. On AMD64 Linux, linux_setup_vdso is responsible for this:

void (*runtime·sysargs)(int32, uint8**) = runtime·linux_setup_vdso;

VDSO stands for Virtual Dynamic Shared Object. Go gets access to this by using the auxiliary vectors. My friend and colleague Manu Garg has a nice writeup about them; I encourage you to read it if you are unfamiliar.

The reason is to acquire initial random data samples:

if(elf_auxv[i].a_type == AT_RANDOM) {
          runtime·startup_random_data = (byte*)elf_auxv[i].a_un.a_val;
          runtime·startup_random_data_len = 16;

Who uses these samples, and where? If we trace this out, …

get_random_data piggybacks off of this body. It has only one use in the runtime, and that is in the runtime package’s init function if AES support is detected:

var rnd unsafe.Pointer
  var n int32
  get_random_data(&rnd, &n)
  if n > hashRandomBytes {
   n = hashRandomBytes
  memmove(unsafe.Pointer(&aeskeysched[0]), rnd, uintptr(n))
  if n < hashRandomBytes {
   // Not very random, but better than nothing.
   for t := nanotime(); n < hashRandomBytes; n++ {
    aeskeysched[n] = byte(t >> uint(8*(n%8)))

The side-effects are left in aeskeysched, which is used in the AES hashing procedures aeshashbody, aeshash32, and aeshash64—the latter being provided as reference:

TEXT runtime·aeshash64(SB),NOSPLIT,$0-32
 MOVQ p+0(FP), AX // ptr to data
 // s+8(FP) is ignored, it is always sizeof(int64)
 MOVQ h+16(FP), X0 // seed
 PINSRQ $1, (AX), X0 // data
 AESENC runtime·aeskeysched+0(SB), X0
 AESENC runtime·aeskeysched+16(SB), X0
 AESENC runtime·aeskeysched+0(SB), X0
 MOVQ X0, ret+24(FP)

The AES instruction set reference may be helpful. But back to the original question of who uses this, let’s discuss that:

The answer lies in the index expressions specification for map types found in the Go Language Specification. Why? Because maps use hashes for bucketing KeyType-s. I had originally planned on writing a separate post on this topic—might still—but the cat is out of the bag now! One takeaway, as the blog post Go maps in action indicates, is that the user should not expect stable map iteration order! Even if the iteration orders appeared to be the same across runs on different processes on the same host, there is no guarantee that the order would ever be consistent on another host, especially if it possessed a different GOARCH or GOOS!

The next discussion is the last one, for the article has gotten long—and there is a dearth of more topics to cover!

The runtime then calls the operating system initialization routine. On Linux, it is trivial:

 runtime·ncpu = getproccount();

This ncpu value is used in a few places:

I will now close out this part of the article here and save the remaining topics for a followup post or two. The big things to follow are the scheduler, memory manager, and the developer’s entrypoint of the Go programmer’s binary.

Don’t worry: I will link the posts together! Even if it takes a little bit of time for the following edition to come out (most of the research is done; just needs prosaic conversion), I expect the general ethos findings to remain true in spite of the partial rewrite of the Go runtime from C and .goc to Go. Stay tuned!

For extracurricular reading, my friend Jeremie Le Hen suggested a quick discussion on the following (included in the sandbox):

I think it would be worth showing at this point some differences between Go-generated binaries and standard binaries, which are kind of bloated nowadays:
$ ldd entrypoint
 not a dynamic executable
$ cat entrypoint.c 
int main(int argc, char *argv[])
  // do nothing
  return 0;
$ gcc -o entrypoint.c.bin -static entrypoint.c 
$ ls -l entrypoint.c.bin entrypoint
-rwxr-x--- 1 jlehen jlehen 581704 Feb 25 09:44 entrypoint
-rwxr-x--- 1 jlehen jlehen 876940 Feb 25 09:59 entrypoint.c.bin
Also the C binary has a plethora of ELF sections, whereas the Go executable has only 7 if you exclude debug sections, and 3 of them are Go-specific. This means the Go compiler is really independent from the classic C runtime, and doesn’t even use the C runtime objects (crtX.o stuff). The libc doesn’t seem to be used at all either, which is not surprising as there are not dependencies between it and the C runtime objects. This can be verified with objdump -t on the binaries:
  • on entrypoint.c.bin there is a whole lot of misc mostly-internal symbols, which originate in the standard lib;
  • on entrypoint, you see a nicely named collection of symbols, mainly divided in two big sections gc and runtime.

Now that we really are finished, I would like to say a big thank-you to Manu Garg and Jeremie Le Hen for their editorial review and contributions! Big round of applause!

follow us in feedly

1 comment :

  1. -64*1024+104 could it be initial stack space + context of goroutine?



None of the content contained herein represents the views of my employer nor should it be construed in such a manner. . All content is copyright Matt T. Proud and may not be reproduced in whole or part without expressed permission.