airs-notes/linkers-4.md

# Linkers part 4

## Shared Libraries

We’ve talked a bit about what object files and executables look like, so what
do shared libraries look like? I’m going to focus on ELF shared libraries as
used in SVR4 (and GNU/Linux, etc.), as they are the most flexible shared
library implementation and the one I know best.

Windows shared libraries, known as DLLs, are less flexible in that you have to
compile code differently depending on whether it will go into a shared library
or not. You also have to express symbol visibility in the source code. This is
not inherently bad, and indeed ELF has picked up some of these ideas over time,
but the ELF format makes more decisions at link time and is thus more powerful.

When the program linker creates a shared library, it does not yet know which
virtual address that shared library will run at. In fact, in different
processes, the same shared library will run at different address, depending on
the decisions made by the dynamic linker. This means that shared library code
must be position independent. More precisely, it must be position independent
after the dynamic linker has finished loading it. It is always possible for the
dynamic linker to convert any piece of code to run at any virtual address,
given sufficient relocation information. However, performing the reloc
computations must be done every time the program starts, implying that it will
start more slowly. Therefore, any shared library system seeks to generate
position independent code which requires a minimal number of relocations to be
applied at runtime, while still running at close to the runtime efficiency of
position dependent code.

An additional complexity is that ELF shared libraries were designed to be
roughly equivalent to ordinary archives. This means that by default the main
executable may override symbols in the shared library, such that references in
the shared library will call the definition in the executable, even if the
shared library also defines that same symbol. For example, an executable may
define its own version of `malloc`. The C library also defines `malloc`, and
the C library contains code which calls `malloc`. If the executable defines
`malloc` itself, it will override the function in the C library. When some
other function in the C library calls `malloc`, it will call the definition in
the executable, not the definition in the C library.

There are thus different requirements pulling in different directions for any
specific ELF implementation. The right implementation choices will depend on
the characteristics of the processor. That said, most, but not all, processors
make fairly similar decisions. I will describe the common case here. An example
of a processor which uses the common case is the i386; an example of a
processor which make some different decisions is the PowerPC.

In the common case, code may be compiled in two different modes. By default,
code is position dependent. Putting position dependent code into a shared
library will cause the program linker to generate a lot of relocation
information, and cause the dynamic linker to do a lot of processing at
runtime. Code may also be compiled in position independent mode, typically
with the `-fpic` option. Position independent code is slightly slower when it
calls a non-static function or refers to a global or static variable. However,
it requires much less relocation information, and thus the dynamic linker will
start the program faster.

Position independent code will call non-static functions via the *Procedure
Linkage Table* or *PLT*. This PLT does not exist in .o files. In a .o file, use
of the PLT is indicated by a special relocation. When the program linker
processes such a relocation, it will create an entry in the PLT. It will
adjust the instruction such that it becomes a PC-relative call to the PLT
entry. PC-relative calls are inherently position independent and thus do not
require a relocation entry themselves. The program linker will create a
relocation for the PLT entry which tells the dynamic linker which symbol is
associated with that entry. This process reduces the number of dynamic
relocations in the shared library from one per function call to one per
function called.

Further, PLT entries are normally relocated lazily by the dynamic linker. On
most ELF systems this laziness may be overridden by setting the LD_BIND_NOW
environment variable when running the program. However, by default, the dynamic
linker will not actually apply a relocation to the PLT until some code actually
calls the function in question. This also speeds up startup time, in that many
invocations of a program will not call every possible function. This is
particularly true when considering the shared C library, which has many more
function calls than any typical program will execute.

In order to make this work, the program linker initializes the PLT entries to
load an index into some register or push it on the stack, and then to branch to
common code. The common code calls back into the dynamic linker, which uses the
index to find the appropriate PLT relocation, and uses that to find the
function being called. The dynamic linker then initializes the PLT entry with
the address of the function, and then jumps to the code of the function. The
next time the function is called, the PLT entry will branch directly to the
function.

Before giving an example, I will talk about the other major data structure in
position independent code, the *Global Offset Table* or *GOT*. This is used for
global and static variables. For every reference to a global variable from
position independent code, the compiler will generate a load from the GOT to
get the address of the variable, followed by a second load to get the actual
value of the variable. The address of the GOT will normally be held in a
register, permitting efficient access. Like the PLT, the GOT does not exist in
a .o file, but is created by the program linker. The program linker will create
the dynamic relocations which the dynamic linker will use to initialize the GOT
at runtime. Unlike the PLT, the dynamic linker always fully initializes the GOT
when the program starts.

For example, on the i386, the address of the GOT is held in the register
`%ebx`. This register is initialized at the entry to each function in position
independent code. The initialization sequence varies from one compiler to
another, but typically looks something like this:

```asm
call __i686.get_pc_thunk.bx
add $offset,%ebx
```

The function `__i686.get_pc_thunk.bx` simply looks like this:

```asm
mov (%esp),%ebx
ret
```

This sequence of instructions uses a position independent sequence to get the
address at which it is running. Then is uses an offset to get the address of
the GOT. Note that this requires that the GOT always be a fixed offset from the
code, regardless of where the shared library is loaded. That is, the dynamic
linker must load the shared library as a fixed unit; it may not load different
parts at varying addresses.

Global and static variables are now read or written by first loading the
address via a fixed offset from `%ebx`. The program linker will create dynamic
relocations for each entry in the GOT, telling the dynamic linker how to
initialize the entry. These relocations are of type `GLOB_DAT`.

For function calls, the program linker will set up a PLT entry to look like
this:

```asm
jmp *offset(%ebx)
pushl #index
jmp first_plt_entry
```

The program linker will allocate an entry in the GOT for each entry in the
PLT. It will create a dynamic relocation for the GOT entry of type `JMP_SLOT`.
It will initialize the GOT entry to the base address of the shared library plus
the address of the second instruction in the code sequence above. When the
dynamic linker does the initial lazy binding on a `JMP_SLOT` reloc, it will
simply add the difference between the shared library load address and the
shared library base address to the GOT entry. The effect is that the first jmp
instruction will jump to the second instruction, which will push the index
entry and branch to the first PLT entry. The first PLT entry is special, and
looks like this:

```asm
pushl 4(%ebx)
jmp *8(%ebx)
```

This references the second and third entries in the GOT. The dynamic linker
will initialize them to have appropriate values for a callback into the dynamic
linker itself. The dynamic linker will use the index pushed by the first code
sequence to find the `JMP_SLOT` relocation. When the dynamic linker determines
the function to be called, it will store the address of the function into the
GOT entry references by the first code sequence. Thus, the next time the
function is called, the jmp instruction will branch directly to the right code.

That was a fast pass over a lot of details, but I hope that it conveys the
main idea. It means that for position independent code on the i386, every call
to a global function requires one extra instruction after the first time it is
called. Every reference to a global or static variable requires one extra
instruction. Almost every function uses four extra instructions when it starts
to initialize `%ebx` (leaf functions which do not refer to any global variables
do not need to initialize `%ebx`). This all has some negative impact on the
program cache. This is the runtime performance penalty paid to let the dynamic
linker start the program quickly.

On other processors, the details are naturally different. However, the general
flavour is similar: position independent code in a shared library starts faster
and runs slightly slower.

More tomorrow.