add stuff
This commit is contained in:
parent
f54c03cf01
commit
bd3524e516
44
README.md
44
README.md
|
@ -1,3 +1,45 @@
|
|||
# airs-notes
|
||||
|
||||
Collection of ELF and GOLD linker notes from AIRS' blog, for easier searching
|
||||
## Source
|
||||
|
||||
https://www.airs.com/blog/index.php?s=linkers+part
|
||||
|
||||
Authored and copyright by Ian Lance Taylor, collected here fore easy lookup.
|
||||
|
||||
## Index
|
||||
|
||||
[Linkers part 1: introduction](/linkers-1.md)
|
||||
[Linkers part 2: technial introduction](/linkers-2.md)
|
||||
[Linkers part 3: address spaces, object file formats](/linkers-3.md)
|
||||
[Linkers part 4: shared libraries](/linkers-4.md)
|
||||
[Linkers part 5: shared libraries redux, ELF symbols](/linkers-5.md)
|
||||
[Linkers part 6: relocations, position-dependent libraries](/linkers-6.md)
|
||||
[Linkers part 7: thread-local storage](/linkers-7.md)
|
||||
[Linkers part 8: ELF segments and sections](/linkers-8.md)
|
||||
[Linkers part 9: symbol versions, relaxation](/linkers-9.md)
|
||||
[Linkers part 10: parallel linking](/linkers-10.md)
|
||||
[Linkers part 11: archives](/linkers-11.md)
|
||||
[Linkers part 12: symbol resolution](/linkers-12.md)
|
||||
[Linkers part 13: symbol versions redux](/linkers-13.md)
|
||||
[Linkers part 14: link-time optimization, initialization code](/linkers-14.md)
|
||||
[Linkers part 15: COMDAT sections](/linkers-15.md)
|
||||
[Linkers part 16: C++ template instantiation, exception frames](/linkers-16.md)
|
||||
[Linkers part 17: warning symbols](/linkers-17.md)
|
||||
[Linkers part 18: incremental linking](/linkers-18.md)
|
||||
[Linkers part 19: `__start` and `__stop` symbols, byte swapping](/linkers-19.md)
|
||||
[Linkers part 20: ending note](/linkers-20.md)
|
||||
|
||||
Other articles included as well:
|
||||
|
||||
[GCC exception frames](/gcc-exception-frames.md)
|
||||
[Linker combreloc](/linker-combreloc.md)
|
||||
[Linker relro](/linker-relro.md)
|
||||
[Combining versions](/combining-versions.md)
|
||||
[Version scripts](/version-scripts.md)
|
||||
[Protected symbols](/protected-symbols.md)
|
||||
[`.eh_frame`](/eh_frame.md)
|
||||
[`.eh_frame_hdr`](/eh_frame_hdr.md)
|
||||
[`.gcc_except_table`](/gcc_except_table.md)
|
||||
[Executable stack](/executable-stack.md)
|
||||
[Piece of PIE](/piece-of-pie.md)
|
||||
|
||||
|
|
|
@ -0,0 +1,58 @@
|
|||
# Combining versions
|
||||
|
||||
Sun introduced a symbol versioning scheme to use for the linker. Their
|
||||
implementation is relatively simple: symbol versions are defined in a version
|
||||
script provided when a shared library was created. The dynamic linker can
|
||||
verify that all required versions are present. This is useful for ensuring that
|
||||
an application can run with a specific version of the library.
|
||||
|
||||
In the Sun versioning scheme, when a symbol is changed to have an incompatible
|
||||
interface, the library file name must change. This then produces a new
|
||||
`DT_SONAME` entry, which leads to new `DT_NEEDED` entries, and thus manages
|
||||
incompatibility at that level.
|
||||
|
||||
Ulrich Drepper and Eric Youngdale introduced a much more sophisticated symbol
|
||||
versioning scheme, which is used by the glibc, the GNU linker, and gold. The
|
||||
key differences are that versions may be specified in object files and that
|
||||
shared libraries may contain multiple independent versions of the same symbol.
|
||||
Versions are specified in object files by naming the symbol `NAME@VERSION` or
|
||||
`NAME@@VERSION`. In the former case the symbol is a hidden version, available
|
||||
only by specific request. In the latter case the symbol is a default version,
|
||||
and references to `NAME` will be linked to `NAME@@VERSION`. Versions may also
|
||||
be specified in version scripts.
|
||||
|
||||
This facility means that in principle it is never necessary to change the
|
||||
library file name. The versioning scheme lets the dynamic linker direct each
|
||||
symbol reference to the appropriate version. This in turn means that in a
|
||||
complicated program with many shared libraries compiled against different
|
||||
versions of the base library, only one instance of the base library needs to be
|
||||
loaded.
|
||||
|
||||
However, this additional complexity leads to additional ambiguity. There are
|
||||
now two possible sources of a symbol version: the name in the object file and
|
||||
an entry in the version script. There is the possibility that two instances of
|
||||
the same name will disagree on whether the name should be globally visible or
|
||||
not–in fact, this is normal, as undefined references will always use
|
||||
`NAME@VERSION`, not `NAME@@VERSION`. Symbol overriding can be confusing: if the
|
||||
main executable defines `NAME` without a version, which versions should it
|
||||
override in the shared library? Which version should be used in the program?
|
||||
Symbol visibility adds an additional wrinkle to this.
|
||||
|
||||
The most important issue for the linker arises when it sees both NAME and
|
||||
`NAME@VERSION`, and then sees `NAME@@VERSION`. At that time the linker has seen
|
||||
two separate symbols and has to decide whether to merge them. The rules that
|
||||
gold currently follows are these:
|
||||
|
||||
* If `NAME` is hidden, and `NAME@@VERSION` is in a shared object, they are two
|
||||
independent symbols, and we do not change `NAME` or its version.
|
||||
* If `NAME` already has a version, because we earlier saw `NAME@@VERSION2`,
|
||||
then we produce two separate symbols, and leave `NAME@@VERSION2` as the
|
||||
default symbol.
|
||||
* Otherwise, we change the version of `NAME` to `VERSION`, and do normal symbol
|
||||
resolution.
|
||||
|
||||
I recently fixed a bug in this code in gold, which was breaking symbol
|
||||
overriding in a specific case. I wouldn’t be surprised if there are more bugs.
|
||||
As far as I know nobody has worked through all the symbol combining issues and
|
||||
defined what should happen.
|
||||
|
|
@ -0,0 +1,124 @@
|
|||
# .eh_frame
|
||||
|
||||
When gcc generates code that handles exceptions, it produces tables that
|
||||
describe how to unwind the stack. These tables are found in the `.eh_frame`
|
||||
section. The format of the `.eh_frame` section is very similar to the format of
|
||||
a DWARF `.debug_frame` section. Unfortunately, it is not precisely identical. I
|
||||
don’t know of any documentation which describes this format. The following
|
||||
should be read in conjunction with the relevant section of the DWARF standard,
|
||||
available from http://dwarfstd.org.
|
||||
|
||||
The `.eh_frame` section is a sequence of records. Each record is either a CIE
|
||||
(Common Information Entry) or an FDE (Frame Description Entry). In general
|
||||
there is one CIE per object file, and each CIE is associated with a list of
|
||||
FDEs. Each FDE is typically associated with a single function. The CIE and the
|
||||
FDE together describe how to unwind to the caller if the current instruction
|
||||
pointer is in the range covered by the FDE.
|
||||
|
||||
There should be exactly one FDE covering each instruction which may be being
|
||||
executed when an exception occurs. By default an exception can only occur
|
||||
during a function call or a throw. When using the `-fnon-call-exceptions` gcc
|
||||
option, an exception can also occur on most memory references and floating
|
||||
point operations. When using `-fasynchronous-unwind-tables`, the FDE will cover
|
||||
every instruction, to permit unwinding from a signal handler.
|
||||
|
||||
The general format of a CIE or FDE starts as follows:
|
||||
|
||||
* Length of record. Read 4 bytes. If they are not `0xffffffff`, they are the
|
||||
length of the CIE or FDE record. Otherwise the next 64 bits holds the length,
|
||||
and this is a 64-bit DWARF format. This is like `.debug_frame`.
|
||||
* A 4 byte ID. For a CIE this is 0. For an FDE it is the byte offset from this
|
||||
field to the start of the CIE with which this FDE is associated. The byte
|
||||
offset goes to the length record of the CIE. A positive value goes backward;
|
||||
that is, you have to subtract the value of the ID field from the current byte
|
||||
position to get the CIE position. This differs from `.debug_frame` in that
|
||||
the offset is relative rather than being an offset into the `.debug_frame`
|
||||
section.
|
||||
|
||||
A CIE record continues as follows:
|
||||
|
||||
* 1 byte CIE version. As of this writing this should be 1 or 3.
|
||||
* NUL terminated augmentation string. This is a sequence of characters. Very
|
||||
old versions of gcc used the string “eh” here, but I won’t document that.
|
||||
This is described further below.
|
||||
* Code alignment factor, an unsigned LEB128 (LEB128 is a DWARF encoding for
|
||||
numbers which I won’t describe here). This should always be 1 for `.eh_frame`.
|
||||
* Data alignment factor, a signed LEB128. This is a constant factored out of
|
||||
offset instructions, as in `.debug_frame`.
|
||||
* The return address register. In CIE version 1 this is a single byte; in CIE
|
||||
version 3 this is an unsigned LEB128. This indicates which column in the
|
||||
frame table represents the return address.
|
||||
|
||||
The next fields of the CIE depend on the augmentation string.
|
||||
|
||||
* If the augmentation string starts with ‘z’, we now find an unsigned LEB128
|
||||
which is the length of the augmentation data, rounded up so that the CIE ends
|
||||
on an address boundary. This is used to skip to the end of the augmentation
|
||||
data if an unrecognized augmentation character is seen.
|
||||
* If the next character in the augmentation string is ‘L’, the next byte in the
|
||||
CIE is the LSDA (Language Specific Data Area) encoding. This is a
|
||||
`DW_EH_PE_xxx` value (described later). The default is `DW_EH_PE_absptr`.
|
||||
* If the next character in the augmentation string is ‘R’, the next byte in the
|
||||
CIE is the FDE encoding. This is a `DW_EH_PE_xxx` value. The default is
|
||||
`DW_EH_PE_absptr`.
|
||||
* The character ‘S’ in the augmentation string means that this CIE represents a
|
||||
stack frame for the invocation of a signal handler. When unwinding the stack,
|
||||
signal stack frames are handled slightly differently: the instruction pointer
|
||||
is assumed to be before the next instruction to execute rather than after it.
|
||||
* If the next character in the augmentation string is ‘P’, the next byte in the
|
||||
CIE is the personality encoding, a `DW_EH_PE_xxx` value. This is followed by
|
||||
a pointer to the personality function, encoded using the personality
|
||||
encoding. I’ll describe the personality function some other day.
|
||||
|
||||
The remaining bytes are an array of `DW_CFA_xxx` opcodes which define the
|
||||
initial values for the frame table. This is then followed by `DW_CFA_nop`
|
||||
padding bytes as required to match the total length of the CIE.
|
||||
|
||||
An FDE starts with the length and ID described above, and then continues as
|
||||
follows.
|
||||
|
||||
* The starting address to which this FDE applies. This is encoded using the FDE
|
||||
encoding specified by the associated CIE.
|
||||
* The number of bytes after the start address to which this FDE applies. This
|
||||
is encoded using the FDE encoding.
|
||||
* If the CIE augmentation string starts with ‘z’, the FDE next has an unsigned
|
||||
LEB128 which is the total size of the FDE augmentation data. This may be used
|
||||
to skip data associated with unrecognized augmentation characters.
|
||||
* If the CIE does not specify `DW_EH_PE_omit` as the LSDA encoding, the FDE
|
||||
next has a pointer to the LSDA, encoded as specified by the CIE.
|
||||
|
||||
The remaining bytes in the FDE are an array of `DW_CFA_xxx` opcodes which set
|
||||
values in the frame table for unwinding to the caller.
|
||||
|
||||
The `DW_EH_PE_xxx` encodings describe how to encode values in a CIE or FDE. The
|
||||
basic encoding is as follows:
|
||||
|
||||
* `DW_EH_PE_absptr = 0x00`: An absolute pointer. The size is determined by
|
||||
whether this is a 32-bit or 64-bit address space, and will be 32 or 64 bits.
|
||||
* `DW_EH_PE_omit = 0xff`: The value is omitted.
|
||||
* `DW_EH_PE_uleb128 = 0x01`: The value is an unsigned LEB128.
|
||||
* `DW_EH_PE_udata2 = 0x02`, `DW_EH_PE_udata4 = 0x03`, `DW_EH_PE_udata8 = 0x04`:
|
||||
The value is stored as unsigned data with the specified number of bytes.
|
||||
* `DW_EH_PE_signed = 0x08`: A signed number. The size is determined by whether
|
||||
this is a 32-bit or 64-bit address space. I don’t think this ever appears in
|
||||
a CIE or FDE in practice.
|
||||
* `DW_EH_PE_sleb128 = 0x09`: A signed LEB128. Not used in practice.
|
||||
* `DW_EH_PE_sdata2 = 0x0a`, `DW_EH_PE_sdata4 = 0x0b`, `DW_EH_PE_sdata8 = 0x0c`:
|
||||
The value is stored as signed data with the specified number of bytes. Not
|
||||
used in practice.
|
||||
|
||||
In addition the above basic encodings, there are modifiers.
|
||||
|
||||
* `DW_EH_PE_pcrel = 0x10`: Value is PC relative.
|
||||
* `DW_EH_PE_textrel = 0x20`: Value is text relative.
|
||||
* `DW_EH_PE_datarel = 0x30`: Value is data relative.
|
||||
* `DW_EH_PE_funcrel = 0x40`: Value is relative to start of function.
|
||||
* `DW_EH_PE_aligned = 0x50`: Value is aligned: padding bytes are inserted as
|
||||
required to make value be naturally aligned.
|
||||
* `DW_EH_PE_indirect = 0x80`: This is actually the address of the real value.
|
||||
|
||||
If you follow all that, and also read up on `.debug_frame`, then you have
|
||||
enough information to unwind the stack at runtime, e.g. to implement glibc’s
|
||||
backtrace function. Later I’ll describe the LSDA and the personality function,
|
||||
which work together to implement exception catching on top of stack unwinding.
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
# .eh_frame_hdr
|
||||
|
||||
If you followed my last post, you will see that in order to unwind the stack
|
||||
you have to find the FDE associated with a given program counter value. There
|
||||
are two steps to this problem. The first one is finding the CIEs and FDEs at
|
||||
all. The second one is, given the set of FDEs, finding the one you need.
|
||||
|
||||
The old way this worked was that gcc would create a global constructor which
|
||||
called the function `__register_frame_info`, passing a pointer to the
|
||||
`.eh_frame` data and a pointer to the object. The latter pointer would indicate
|
||||
the shared library, and was used to deregister the information after a dlclose.
|
||||
When looking for an FDE, the unwinder would walk through the registered frames,
|
||||
and sort them. Then it would use the sorted list to find the desired FDE.
|
||||
|
||||
The old way still works, but these days, at least on GNU/Linux, the sorting is
|
||||
done at link time, which is better than doing it at runtime. Both gold and the
|
||||
GNU linker support an option `--eh-frame-hdr` which tell them to construct a
|
||||
header for all the .eh_frame sections. This header is placed in a section named
|
||||
.eh_frame_hdr and also in a PT_GNU_EH_FRAME segment. At runtime the unwinder
|
||||
can find all the `PT_GNU_EH_FRAME` segments by calling `dl_iterate_phdr`.
|
||||
|
||||
The format of the `.eh_frame_hdr` section is as follows:
|
||||
|
||||
* A 1 byte version number, currently 1.
|
||||
* A 1 byte encoding of the pointer to the exception frames. This is a
|
||||
`DW_EH_PE_xxx` value. It is normally `DW_EH_PE_pcrel | DW_EH_PE_sdata4`,
|
||||
meaning a 4 byte relative offset.
|
||||
* A 1 byte encoding of the count of the number of FDEs in the lookup table.
|
||||
This is a `DW_EH_PE_xxx` value. It is normally `DW_EH_PE_udata4`, meaning a 4
|
||||
byte unsigned count.
|
||||
* A 1 byte encoding of the entries in the lookup table. This is a
|
||||
`DW_EH_PE_xxx` value. It is normally `DW_EH_PE_datarel | DW_EH_PE_sdata4`,
|
||||
meaning a 4 byte offset from the start of the `.eh_frame_hdr` section. That
|
||||
is the only encoding that gcc’s current unwind library supports.
|
||||
* A pointer to the contents of the `.eh_frame` section, encoded as indicated by
|
||||
the second byte in the header. This pointer is only used if the format of the
|
||||
lookup table is not supported or is for some reason omitted..
|
||||
* The number of FDE pointers in the table, encoded as indicated by the third
|
||||
byte in the header. If there are no FDEs, the encoding can be `DW_EH_PE_omit`
|
||||
and this number will not be present.
|
||||
* The lookup table itself, starting at a 4-byte aligned address in memory.
|
||||
Assuming the fourth byte in the header is `DW_EH_PE_datarel | DW_EH_PE_sdata4`,
|
||||
each entry in the table is 8 bytes long. The first four bytes are an offset
|
||||
to the initial PC value for the FDE. The last four byte are an offset to the
|
||||
FDE data itself. The table is sorted by starting PC.
|
||||
|
||||
Since FDEs do not overlap, this table is sufficient for the stack unwinder to
|
||||
quickly find the relevant FDE if there is one.
|
||||
|
|
@ -0,0 +1,104 @@
|
|||
# Executable stack
|
||||
|
||||
The gcc compiler implements an extension to C: nested functions. A trivial example:
|
||||
|
||||
```c
|
||||
int f() {
|
||||
int i = 2;
|
||||
int g(int j) { return i + j; }
|
||||
return g(3);
|
||||
}
|
||||
```
|
||||
|
||||
The function `f` will return 5. Note in particular that the nested function `g`
|
||||
refers to the variable i defined in the enclosing function.
|
||||
|
||||
You can mostly treat nested functions as ordinary functions. In particular, you
|
||||
can take the address of a nested function, and you can pass the resulting
|
||||
function pointer to another function, that function can make a call through the
|
||||
function pointer to the nested function, and the nested function will correctly
|
||||
refer to variables in its caller’s stack frame. I’m not here going to go into
|
||||
the details of how this is implemented. What I will say is that gcc currently
|
||||
implements this by writing instructions to the stack and using a pointer to
|
||||
those instructions. This requires that the stack be executable.
|
||||
|
||||
This approach was implemented many years ago, before computers were routinely
|
||||
attacked. In the hostile Internet environment of today, an area of memory that
|
||||
is both writable and executable is dangerous, because it gives an attacker
|
||||
space to create brand new instructions to execute. Since the stack must be
|
||||
writable, this means that we want to make the stack non-executable if possible.
|
||||
Since very few programs use nested functions, this is normally possible. But we
|
||||
don’t want to break those few programs either.
|
||||
|
||||
This is how the GNU tools do it on ELF systems such as GNU/Linux. The compiler
|
||||
adds a new section to all code that it compiles. The section is named
|
||||
`.note.GNU-stack`. It is empty and not allocated, which means that it takes up
|
||||
no space at runtime. If the code being compiled does not require an executable
|
||||
stack—the normal case—the compiler doesn’t set any flags for the section. If
|
||||
the code does require an executable stack, the compiler sets the
|
||||
`SHF_EXECINSTR` flag.
|
||||
|
||||
When the linker links a program, it checks each input object for a
|
||||
`.note.GNU-stack` section. If there is no such section, the linker assumes that
|
||||
the object must be old, and therefore may require an executable stack. If there
|
||||
is such a section, the linker checks the section flags to see whether the code
|
||||
requires an executable stack. The linker discards the `.note.GNU-stack`
|
||||
sections, and creates a `PT_GNU_STACK` segment in the output executable. The
|
||||
`PT_GNU_STACK` segment is empty and is not part of any `PT_LOAD` segment. The
|
||||
segment flags `PF_R` and `PF_W` are always set. If the linker has determined
|
||||
that the program requires an executable stack, it also sets the `PF_X` flag.
|
||||
|
||||
When the Linux kernel starts a program, it looks for a `PT_GNU_STACK` segment.
|
||||
If it does not find one, it sets the stack to be executable (if appropriate for
|
||||
the architecture). If it does find a `PT_GNU_STACK` segment, it marks the stack
|
||||
as executable if the segment flags call for it. (It’s possible to override this
|
||||
and force the kernel to never use an executable stack.) Similarly, the dynamic
|
||||
linker looks for a `PT_GNU_STACK` in any executable or shared library that it
|
||||
loads, and changes the stack to be executable if any of them require it.
|
||||
|
||||
When this all works smoothly, most programs wind up with a non-executable
|
||||
stack, which is what we want. The most common reason that this fails these days
|
||||
is that part of the program is written in assembler, and the assembler code
|
||||
does not create a `.note.GNU_stack` section. If you write assembler code for
|
||||
GNU/Linux, you must always be careful to add the appropriate line to your file.
|
||||
For most targets, the line you want is:
|
||||
|
||||
```asm
|
||||
.section .note.GNU-stack,"",@progbits
|
||||
```
|
||||
|
||||
There are some linker options to control this. The `-z execstack` option tells
|
||||
the linker to mark the program as requiring an executable stack, regardless of
|
||||
the input files. The `-z noexecstack` option marks it as not requiring an
|
||||
executable stack. The gold linker has a `--warn-execstack` option which will
|
||||
cause the linker to warn about any object which is missing a `.note.GNU-stack`
|
||||
option or which has an executable `.note.GNU-stack` option.
|
||||
|
||||
The execstack program may also be used to query whether a program requires an
|
||||
executable stack, and to change its setting.
|
||||
|
||||
These days we could probably change the default: we could probably say that if
|
||||
an object file does not have a `.note.GNU-stack` section, then it does not
|
||||
require an executable stack. That would avoid the problem of files written in
|
||||
assembler which do not create the section. It’s possible that this would cause
|
||||
some programs to incorrectly get a non-executable stack, but I think that would
|
||||
be quite unlikely in practice. An advantage of changing the default would be
|
||||
that the compiler would not have to create an empty `.note.GNU-stack` section
|
||||
in all object files.
|
||||
|
||||
By the way, there is one thing you can do with a normal function that you can
|
||||
not do with a nested function: if the nested function refers to any variables
|
||||
in the enclosing function, you can not return a pointer to the nested function
|
||||
to the caller. If you do, the variable will disappear, so the variable
|
||||
reference in the nested function will be dangling reference. It’s worth noting
|
||||
here that the Go language supports nested function literals which may refer to
|
||||
variables in the enclosing function, and when using Go this works correctly.
|
||||
The compiler creates variables on the heap if necessary, so they do not
|
||||
disappear until the garbage collector determines that nothing refers to them
|
||||
any more.
|
||||
|
||||
Finally, I’ll mention that there are some plans to implement a different scheme
|
||||
for nested functions in C, one which does not require any memory to be both
|
||||
writable and executable, but these plans have not yet been implemented. I’ll
|
||||
leave the implementation as an exercise for the reader.
|
||||
|
|
@ -0,0 +1,56 @@
|
|||
# GCC Exception Frames
|
||||
|
||||
When an exception is thrown in C++ and caught by one of the calling functions,
|
||||
the supporting libraries need to unwind the stack. With gcc this is done using
|
||||
a variant of DWARF debugging information. The unwind information is loaded at
|
||||
runtime, but is not read unless an exception is thrown. That means that the
|
||||
unwind library needs to have some way of finding the appropriate unwind
|
||||
information at runtime.
|
||||
|
||||
On some systems, this is done by registering the exception frame information
|
||||
when the program starts. The registration is done with a variant of the
|
||||
handling of C++ constructors. This becomes interesting when one shared library
|
||||
can throw an exception which is caught by another shared library. It is
|
||||
possible for such a case to arise when the executable itself never throws
|
||||
exceptions and therefore has no frames to register. Obviously the unwinder
|
||||
needs to be able to find the unwind information for both shared libraries,
|
||||
which means that both shared libraries need to use the same registration
|
||||
functions. With gcc this is normally ensured by putting the unwind code in a
|
||||
shared library, `libgcc_s.so`. Each shared library, and sometimes the
|
||||
executable, will use `libgcc_s.so`. That ensures a single copy of the
|
||||
registration and unwind functions, so the library will be able to reliably
|
||||
unwind across shared libraries. With gcc the use of `libgcc_s.so` can be
|
||||
controlled with the `-shared-libgcc` and `-static-libgcc` options. Normally the
|
||||
right thing will happen by default.
|
||||
|
||||
That approach has a cost: there is an extra shared library, and there is a
|
||||
small cost of registering the unwind information at program startup or library
|
||||
load time (and unregistering it if a shared library is unloaded via dlclose).
|
||||
There is now a better way, which requires linker support.
|
||||
|
||||
Both gold and the GNU linker support the command line option `--eh-frame-hdr`.
|
||||
With this option, when the linker sees the `.eh_frame` sections used to hold
|
||||
the unwind information, it automatically builds a header. This header is a
|
||||
sorted array mapping program counter addresses to unwind information. The
|
||||
header is recorded as a program segment of type `PT_GNU_EH_FRAME`. (This is a
|
||||
little bit ugly since the `.eh_frame` sections are recognized only by name;
|
||||
ideally they should have a special section type.)
|
||||
|
||||
At runtime, the unwind library can use the `dl_iterate_phdr` function to find
|
||||
the program segments of the executable and all currently loaded shared
|
||||
libraries. It can use that to find the `PT_GNU_EH_FRAME` segments, and use the
|
||||
sorted array in those segments to quickly find the unwind information.
|
||||
|
||||
This approach means that no registration functions are required. It also means
|
||||
that it is not necessary to have a single shared library, since
|
||||
`dl_iterate_phdr` is available no matter which shared library throws the
|
||||
exception.
|
||||
|
||||
This all only works if you have a linker which supports generating
|
||||
`PT_GNU_EH_FRAME` sections, if all the shared libraries and the executable are
|
||||
linked by such a linker, and if you have a working `dl_iterate_phdr` function
|
||||
in your C library or dynamic linker. I think that pretty much restricts this
|
||||
approach to GNU/Linux and possibly other free operating systems. For those
|
||||
scenarios, I hope that gcc will soon be able to stop using `libgcc_s.so` by
|
||||
default.
|
||||
|
|
@ -0,0 +1,157 @@
|
|||
# .gcc_except_table
|
||||
|
||||
Throwing an exception in C++ requires more than unwinding the stack. As the
|
||||
program unwinds, local variable destructors must be executed. Catch clauses
|
||||
must be examined to see if they should catch the exception. Exception
|
||||
specifications must be checked to see if the exception should be redirected to
|
||||
the unexpected handler. Similar issues arise in Go, Java, and even C when using
|
||||
gcc’s cleanup function attribute.
|
||||
|
||||
As I described earlier, each CIE in the unwind data may contain a pointer to a
|
||||
personality function, and each FDE may contain a pointer to the LSDA, the
|
||||
Language Specific Data Area. Each language has its own personality function.
|
||||
The LSDA is only used by the personality function, so it could in principle
|
||||
differ for each language. However, at least for gcc, every language uses the
|
||||
same format, since the LSDA is generated by the language-independent
|
||||
middle-end.
|
||||
|
||||
The personality function takes five arguments:
|
||||
|
||||
1. A int version number, currently 1.
|
||||
2. A bitmask of actions.
|
||||
3. An exception class, a 64-bit unsigned integer which is specific to a language.
|
||||
4. A pointer to information about the specific exception being thrown.
|
||||
5. Unwinder state information.
|
||||
|
||||
The exception class permits code written in one language to work correctly when
|
||||
an exception is thrown by code written in a different language. The value for
|
||||
g++ is “GNUCC++\0” (or “GNUCC++\1” for a dependent exception, which is used
|
||||
when rethrowing an exception). The value for Go is “GNUCGO\0\0”. The exception
|
||||
specific information can only be examined if the exception class is recognized.
|
||||
|
||||
Unwinding the stack for an exception is done in two phases. In the first phase,
|
||||
the unwinder walks up the stack passing the action `_UA_SEARCH_PHASE` (which
|
||||
has the value 1) to each personality function that it finds. The personality
|
||||
function should examine the LSDA to see if there is a handler for the exception
|
||||
being thrown. It should return `_URC_HANDLER_FOUND` (`6`) if there is or
|
||||
`_URC_CONTINUE_UNWIND` (`8`) if there isn’t. The search phase will continue
|
||||
until a handler is found or until the top of the stack is reached. The unwinder
|
||||
will not actually change anything while walking. If the top of the stack is
|
||||
reached the unwinder will simply return, and the calling code will take the
|
||||
appropriate action, which for C++ is to call `std::terminate`. Because of the
|
||||
two phase unwinding approach, if `std::terminate` dumps core, a backtrace will
|
||||
show the code which threw the exception.
|
||||
|
||||
If a handler is found, the second phase begins. The unwinder walks up the stack
|
||||
passing the action `_UA_CLEANUP_PHASE` (`2`) to each personality function. The
|
||||
unwinder will also set `_UA_FORCE_UNWIND` (`8`) in the actions bitmask if the
|
||||
personality function may not catch the exception, because the unwinding is
|
||||
happening due to some event like thread cancellation. The unwinder will walk up
|
||||
the stack until it finds the handler—the stack frame for which the personality
|
||||
function returned `_URC_HANDLER_FOUND`. When it calls that function, the
|
||||
unwinder will pass `_UA_HANDLER_FRAME` (`4`) in the actions bitmask. This time,
|
||||
the unwinder will changes things as it goes, removing stack frames.
|
||||
|
||||
In order to run destructors, the personality function will call `_Unwind_SetIP`
|
||||
on the context parameter to set the program counter to point to the cleanup
|
||||
routine, and then return `_URC_INSTALL_CONTEXT` (`7`) to tell the unwinder to
|
||||
branch to the current context. The address which starts the cleanup is known as
|
||||
a landing pad. The cleanup should do whatever it needs to do, and then call
|
||||
`_Unwind_Resume`. The exception information needs to be passed to
|
||||
`_Unwind_Resume`. The personality routine arranges to pass the exception
|
||||
information to the cleanup by calling `_Unwind_SetGR` passing
|
||||
`__builtin_eh_return_data_regno(0)` and the exception information passed to the
|
||||
personality routine. Each target which supports this approach has to dedicate
|
||||
two registers to holding exception information. This is the first one.
|
||||
|
||||
The personality function which finds the handler works pretty much the same
|
||||
way. It may also use `_Unwind_SetGR` to set a value in
|
||||
`__builtin_eh_return_data_regno(1)` to indicate which exception was found. The
|
||||
exception handler may rethrow the exception via `_Unwind_RaiseException` or it
|
||||
may simply continue a normal execution path.
|
||||
|
||||
At this point we’ve seen everything except how the personality function decides
|
||||
whether it needs to run a cleanup or catch an exception. The personality
|
||||
function makes this decision based on the LSDA. As mentioned above, while the
|
||||
LSDA could be language dependent, in practice it is not. There is a different
|
||||
personality function for each language, but they all do more or less the same
|
||||
thing, omitting aspects which are not relevant for the language (e.g., there is
|
||||
a personality function for C, but it only runs cleanups and does not bother to
|
||||
look for exception handlers).
|
||||
|
||||
The LSDA is found in the section `.gcc_except_table` (the personality function
|
||||
is just a function and lives in the `.text` section as usual). The personality
|
||||
function gets a pointer to it by calling `_Unwind_GetLanguageSpecificData`. The
|
||||
LSDA starts with the following fields:
|
||||
|
||||
1. A 1 byte encoding of the following field (a `DW_EH_PE_xxx` value).
|
||||
2. If the encoding is not `DW_EH_PE_omit`, the landing pad base. This is the
|
||||
base from which landing pad offsets are computed. If this is omitted, the
|
||||
base comes from calling `_Unwind_GetRegionStart`, which returns the beginning
|
||||
of the code described by the current FDE. In practice this field is normally
|
||||
omitted.
|
||||
3. A 1 byte encoding of the entries in the type table (a `DW_EH_PE_xxx` value).
|
||||
4. If the encoding is not `DW_EH_PE_omit`, the types table pointer. This is an
|
||||
unsigned LEB128 value, and is the byte offset from this field to the start
|
||||
of the types table used for exception matching.
|
||||
5. A 1 byte encoding of the fields in the call-site table (a `DW_EH_PE_xxx`
|
||||
value).
|
||||
6. An unsigned LEB128 value holding the length in bytes of the call-site table.
|
||||
|
||||
This header is immediately followed by the call-site table. Each entry in the
|
||||
call-site table has four fields. The number of bytes in the header gives the
|
||||
total length. Each entry in the call-site table describes a particular sequence
|
||||
of instructions within the function that the FDE desribes.
|
||||
|
||||
1. The start of the instructions for the current call site, a byte offset from
|
||||
the landing pad base. This is encoded using the encoding from the header.
|
||||
2. The length of the instructions for the current call site, in bytes. This is
|
||||
encoded using the encoding from the header.
|
||||
3. A pointer to the landing pad for this sequence of instructions, or 0 if
|
||||
there isn’t one. This is a byte offset from the landing pad base. This is
|
||||
encoded using the encoding from the header.
|
||||
4. The action to take, an unsigned LEB128. This is 1 plus a byte offset into
|
||||
the action table. The value zero means that there is no action.
|
||||
|
||||
The call-site table is sorted by the start address field. If the personality
|
||||
function finds that there is no entry for the current PC in the call-site
|
||||
table, then there is no exception information. This should not happen in normal
|
||||
operation, and in C++ will lead to a call to `std::terminate`. If there is an
|
||||
entry in the call-site table, but the landing pad is zero, then there is
|
||||
nothing to do: there are no destructors to run or exceptions to catch. This is
|
||||
a normal case, and the unwinder will simply continue. If the action record is
|
||||
zero, then there are destructors to run but no exceptions to catch. The
|
||||
personality function will arrange to run the destructors as described above,
|
||||
and unwinding will continue.
|
||||
|
||||
Otherwise, we have an offset into the action table. Each entry in the action
|
||||
table is a pair of signed LEB128 values. The first number is a type filter. The
|
||||
second number is a byte offset to the next entry in the action table. A byte
|
||||
offset of 0 ends the current set of actions.
|
||||
|
||||
A type filter of zero indicates a cleanup, which is the same as an action
|
||||
record of zero in the call-site table. This means that there is a cleanup to be
|
||||
called even if none of the types match.
|
||||
|
||||
A positive type filter is an index into the types table. This is a negative
|
||||
index: the value 1 means the entry preceding the types table base, 2 means the
|
||||
entry before that, etc. The size of entries in the types table comes from the
|
||||
encoding in the header, as does the base of the types table. Each entry in the
|
||||
types table is a pointer to a type information structure. If this type
|
||||
information structure matches the type of the exception, then we have found a
|
||||
handler for this exception. The type filter value is a switch value will be
|
||||
passed to the handler in exception register 1. The actual comparison of the
|
||||
type information, and determining the type information from the exception
|
||||
pointer, really is language dependent. In C++ this is a pointer to a
|
||||
`std::type_info` structure. A `NULL` pointer in the types table is a catch-all
|
||||
handler.
|
||||
|
||||
A negative type filter is a byte offset into the types table of a `NULL`
|
||||
terminated list of pointers to type information structures. If the type of the
|
||||
current exception does not match any of the entries in the list, then there is
|
||||
an exception specification error. This is treated as an exception handler with
|
||||
a negative switch value.
|
||||
|
||||
I think that covers everything about how gcc unwinds the stack and throws
|
||||
exceptions.
|
||||
|
|
@ -0,0 +1,23 @@
|
|||
# Linker combreloc
|
||||
|
||||
The GNU linker has a `-z combreloc` option, which is enabled by default (it can
|
||||
be turned off via `-z nocombreloc`). I just implemented this in gold as well.
|
||||
This option directs the linker to sort the dynamic relocations. The sorting is
|
||||
done in order to optimize the dynamic linker.
|
||||
|
||||
The dynamic linker in glibc uses a one element cache when processing relocs: if
|
||||
a relocation refers to the same symbol as the previous relocation, then the
|
||||
dynamic linker reuses the value rather than looking up the symbol again. Thus
|
||||
the dynamic linker gets the best results if the dynamic relocations are sorted
|
||||
so that all dynamic relocations for a given dynamic symbol are adjacent.
|
||||
|
||||
Other than that, the linker sorts together all relative relocations, which
|
||||
don’t have symbols. Two relative relocations, or two relocations against the
|
||||
same symbol, are sorted by the address in the output file. This tends to
|
||||
optimize paging and caching when there are two references from the same page.
|
||||
|
||||
This may seem like a micro-optimization, but it can have a real effect on
|
||||
program startup time, especially if the program has lots of shared libraries.
|
||||
I’ve seen a case where a program starts up 16% faster because the relocations
|
||||
were sorted.
|
||||
|
|
@ -0,0 +1,56 @@
|
|||
# Linker relro
|
||||
|
||||
gcc, the GNU linker, and the glibc dynamic linker cooperate to implement an
|
||||
idea called read-only relocations, or relro. This permits the linker to
|
||||
designate a part of an executable or (more commonly) a shared library as being
|
||||
read-only after dynamic relocations have been applied.
|
||||
|
||||
This may be used for read-only global variables which are initialized to
|
||||
something which requires a relocation, such as the address of a function or a
|
||||
different global variable. Because the global variable requires a runtime
|
||||
initialization in the form of a dynamic relocation, it can not be placed in a
|
||||
read-only segment. However, because it is declared to be constant, and
|
||||
therefore may not be changed by the program, the dynamic linker can mark it as
|
||||
read-only after the dynamic relocation has been applied.
|
||||
|
||||
For some targets this technique may also be used for the PLT or parts of the
|
||||
GOT.
|
||||
|
||||
Making these pages read-only helps catch some cases of memory corruption, and
|
||||
making the PLT in particular read-only helps prevent some types of buffer
|
||||
overflow exploits.
|
||||
|
||||
The first step is in gcc. When gcc sees a variable which is constant but
|
||||
requires a dynamic relocation, it puts it into a section named `.data.rel.ro`
|
||||
(this functionality unfortunately relies on magic section names). A variable
|
||||
which requires a dynamic relocation against a local symbol is put into a
|
||||
`.data.rel.ro.local` section; this helps group such variables together, so that
|
||||
the dynamic linker may apply the relocations, which will always be `RELATIVE`
|
||||
relocations, more efficiently, especially when using `combreloc`.
|
||||
|
||||
The linker groups `.data.rel.ro` and `.data.rel.ro.local` sections as usual.
|
||||
The new step is that the linker then emits a `PT_GNU_RELRO` program segment
|
||||
which covers these sections. If the PLT and/or GOT can be read-only after
|
||||
dynamic relocations, they are put next to the `.data.rel.ro` sections and also
|
||||
become part of the new segment. This segment will enclosed within a `PT_LOAD`
|
||||
segment. The `p_vaddr` field of the `PT_GNU_RELRO` segment gives the virtual
|
||||
address of the start of the read-only after dynamic relocations code, and the
|
||||
`p_memsz` field gives its length.
|
||||
|
||||
When the dynamic linker sees a `PT_GNU_RELRO` segment, it uses mprotect to mark
|
||||
the pages as read-only after the dynamic relocations have been applied. Of
|
||||
course this only works if the segment does in fact cover an entire page. The
|
||||
linker will try to force this to happen.
|
||||
|
||||
Note that the current dynamic linker code will only work correctly if the
|
||||
`PT_GNU_RELRO` segment starts on a page boundary. This is because the dynamic
|
||||
linker rounds the `p_vaddr` field down to the previous page boundary. If there is
|
||||
anything on the page which should not be read-only, the program is likely to
|
||||
fail at runtime. So in effect the linker must only emit a `PT_GNU_RELRO`
|
||||
segment if it ensures that it starts on a page boundary.
|
||||
|
||||
I see this as a relatively minor security benefit. It is not an optimization as
|
||||
far as I can see. I am documenting it here as part of my general documentation
|
||||
of obscure linker features. The current description of this feature in the GNU
|
||||
linker manual is rather obscure.
|
||||
|
|
@ -0,0 +1,83 @@
|
|||
# Linkers part 1
|
||||
|
||||
I’ve been working on and off on a new linker. To my surprise, I’ve discovered
|
||||
in talking about this that some people, even some computer programmers, are
|
||||
unfamiliar with the details of the linking process. I’ve decided to write some
|
||||
notes about linkers, with the goal of producing an essay similar to my existing
|
||||
one about the GNU configure and build system.
|
||||
|
||||
As I only have the time to write one thing a day, I’m going to do this on my
|
||||
blog over time, and gather the final essay together later. I believe that I may
|
||||
be up to five readers, and I hope y’all will accept this digression into stuff
|
||||
that matters. I will return to random philosophizing and minding other people’s
|
||||
business soon enough.
|
||||
|
||||
## A Personal Introduction
|
||||
|
||||
Who am I to write about linkers?
|
||||
|
||||
I wrote my first linker back in 1988, for the AMOS operating system which ran
|
||||
on Alpha Micro systems. (If you don’t understand the following description,
|
||||
don’t worry; all will be explained below). I used a single global database to
|
||||
register all symbols. Object files were checked into the database after they
|
||||
had been compiled. The link process mainly required identifying the object file
|
||||
holding the main function. Other objects files were pulled in by reference. I
|
||||
reverse engineered the object file format, which was undocumented but quite
|
||||
simple. The goal of all this was speed, and indeed this linker was much faster
|
||||
than the system one, mainly because of the speed of the database.
|
||||
|
||||
I wrote my second linker in 1993 and 1994. This linker was designed and
|
||||
prototyped by Steve Chamberlain while we both worked at Cygnus Support (later
|
||||
Cygnus Solutions, later part of Red Hat). This was a complete reimplementation
|
||||
of the BFD based linker which Steve had written a couple of years before.
|
||||
The primary target was a.out and COFF. Again the goal was speed, especially
|
||||
compared to the original BFD based linker. On SunOS 4 this linker was almost as
|
||||
fast as running the cat program on the input .o files.
|
||||
|
||||
The linker I am now working, called gold, on will be my third. It is
|
||||
exclusively an ELF linker. Once again, the goal is speed, in this case being
|
||||
faster than my second linker. That linker has been significantly slowed down
|
||||
over the years by adding support for ELF and for shared libraries. This support
|
||||
was patched in rather than being designed in. Future plans for the new linker
|
||||
include support for incremental linking–which is another way of increasing
|
||||
speed.
|
||||
|
||||
There is an obvious pattern here: everybody wants linkers to be faster. This is
|
||||
because the job which a linker does is uninteresting. The linker is a speed
|
||||
bump for a developer, a process which takes a relatively long time but adds no
|
||||
real value. So why do we have linkers at all? That brings us to our next topic.
|
||||
|
||||
## A Technical Introduction
|
||||
|
||||
What does a linker do?
|
||||
|
||||
It’s simple: a linker converts object files into executables and shared
|
||||
libraries. Let’s look at what that means. For cases where a linker is used,
|
||||
the software development process consists of writing program code in some
|
||||
language: e.g., C or C++ or Fortran (but typically not Java, as Java normally
|
||||
works differently, using a loader rather than a linker). A compiler translates
|
||||
this program code, which is human readable text, into into another form of
|
||||
human readable text known as assembly code. Assembly code is a readable form of
|
||||
the machine language which the computer can execute directly. An assembler is
|
||||
used to turn this assembly code into an object file. For completeness, I’ll
|
||||
note that some compilers include an assembler internally, and produce an object
|
||||
file directly. Either way, this is where things get interesting.
|
||||
|
||||
In the old days, when dinosaurs roamed the data centers, many programs were
|
||||
complete in themselves. In those days there was generally no compiler–people
|
||||
wrote directly in assembly code–and the assembler actually generated an
|
||||
executable file which the machine could execute directly. As languages liked
|
||||
Fortran and Cobol started to appear, people began to think in terms of
|
||||
libraries of subroutines, which meant that there had to be some way to run the
|
||||
assembler at two different times, and combine the output into a single
|
||||
executable file. This required the assembler to generate a different type of
|
||||
output, which became known as an object file (I have no idea where this name
|
||||
came from). And a new program was required to combine different object files
|
||||
together into a single executable. This new program became known as the linker
|
||||
(the source of this name should be obvious).
|
||||
|
||||
Linkers still do the same job today. In the decades that followed, one new
|
||||
feature has been added: shared libraries.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
# Linkers part 10
|
||||
|
||||
## Parallel Linking
|
||||
|
||||
It is possible to parallelize the linking process somewhat. This can help hide
|
||||
I/O latency and can take better advantage of modern multi-core systems. My
|
||||
intention with gold is to use these ideas to speed up the linking process.
|
||||
|
||||
The first area which can be parallelized is reading the symbols and relocation
|
||||
entries of all the input files. The symbols must be processed in order;
|
||||
otherwise, it will be difficult for the linker to resolve multiple definitions
|
||||
correctly. In particular all the symbols which are used before an archive must
|
||||
be fully processed before the archive is processed, or the linker won’t know
|
||||
which members of the archive to include in the link (I guess I haven’t talked
|
||||
about archives yet). However, despite these ordering requirements, it can be
|
||||
beneficial to do the actual I/O in parallel.
|
||||
|
||||
After all the symbols and relocations have been read, the linker must complete
|
||||
the layout of all the input contents. Most of this can not be done in parallel,
|
||||
as setting the location of one type of contents requires knowing the size of
|
||||
all the preceding types of contents. While doing the layout, the linker can
|
||||
determine the final location in the output file of all the data which needs to
|
||||
be written out.
|
||||
|
||||
After layout is complete, the process of reading the contents, applying
|
||||
relocations, and writing the contents to the output file can be fully
|
||||
parallelized. Each input file can be processed separately.
|
||||
|
||||
Since the final size of the output file is known after the layout phase, it is
|
||||
possible to use `mmap` for the output file. When not doing relaxation, it is
|
||||
then possible to read the input contents directly into place in the output
|
||||
file, and to relocation them in place. This reduces the number of system calls
|
||||
required, and ideally will permit the operating system to do optimal disk I/O
|
||||
for the output file.
|
||||
|
||||
Just a short entry tonight. More tomorrow.
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
# Linkers part 11
|
||||
|
||||
## Archives
|
||||
|
||||
Archives are a traditional Unix package format. They are created by the `ar`
|
||||
program, and they are normally named with a `.a` extension. Archives are passed
|
||||
to a Unix linker with the `-l` option.
|
||||
|
||||
Although the `ar` program is capable of creating an archive from any type of
|
||||
file, it is normally used to put object files into an archive. When it is used
|
||||
in this way, it creates a symbol table for the archive. The symbol table lists
|
||||
all the symbols defined by any object file in the archive, and for each symbol
|
||||
indicates which object file defines it. Originally the symbol table was created
|
||||
by the `ranlib` program, but these days it is always created by `ar` by default
|
||||
(despite this, many Makefiles continue to run `ranlib` unnecessarily).
|
||||
|
||||
When the linker sees an archive, it looks at the archive’s symbol table. For
|
||||
each symbol the linker checks whether it has seen an undefined reference to
|
||||
that symbol without seeing a definition. If that is the case, it pulls the
|
||||
object file out of the archive and includes it in the link. In other words, the
|
||||
linker pulls in all the object files which defines symbols which are referenced
|
||||
but not yet defined.
|
||||
|
||||
This operation repeats until no more symbols can be defined by the archive.
|
||||
This permits object files in an archive to refer to symbols defined by other
|
||||
object files in the same archive, without worrying about the order in which
|
||||
they appear.
|
||||
|
||||
Note that the linker considers an archive in its position on the command line
|
||||
relative to other object files and archives. If an object file appears after an
|
||||
archive on the command line, that archive will not be used to defined symbols
|
||||
referenced by the object file.
|
||||
|
||||
In general the linker will not include archives if they provide a definition
|
||||
for a common symbol. You will recall that if the linker sees a common symbol
|
||||
followed by a defined symbol with the same name, it will treat the common
|
||||
symbol as an undefined reference. That will only happen if there is some other
|
||||
reason to include the defined symbol in the link; the defined symbol will not
|
||||
be pulled in from the archive.
|
||||
|
||||
There was an interesting twist for common symbols in archives on old
|
||||
`a.out`-based SunOS systems. If the linker saw a common symbol, and then saw a
|
||||
common symbol in an archive, it would not include the object file from the
|
||||
archive, but it would change the size of the common symbol to the size in the
|
||||
archive if that were larger than the current size. The C library relied on this
|
||||
behaviour when implementing the `stdin` variable.
|
||||
|
||||
My next posting should be on Monday.
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
# Linkers part 12
|
||||
|
||||
I apologize for the pause in posts. We moved over the weekend. Last Friday AT&T
|
||||
told me that the new DSL was working at our new house. However, it did not
|
||||
actually start working outside the house until Wednesday. Then a problem with
|
||||
the internal wiring meant that it was not working inside the house until today.
|
||||
I am now finally back online at home.
|
||||
|
||||
## Symbol Resolution
|
||||
|
||||
I find that symbol resolution is one of the trickier aspects of a linker.
|
||||
Symbol resolution is what the linker does the second and subsequent times that
|
||||
it sees a particular symbol. I’ve already touched on the topic in a few
|
||||
previous entries, but let’s look at it in a bit more depth.
|
||||
|
||||
Some symbols are local to a specific object files. We can ignore these for the
|
||||
purposes of symbol resolution, as by definition the linker will never see them
|
||||
more than once. In ELF these are the symbols with a binding of `STB_LOCAL`.
|
||||
|
||||
In general, symbols are resolved by name: every symbol with the same name is
|
||||
the same entity. We’ve already seen a few exceptions to that general rule. A
|
||||
symbol can have a version: two symbols with the same name but different
|
||||
versions are different symbols. A symbol can have non-default visibility: a
|
||||
symbol with hidden visibility in one shared library is not the same as a symbol
|
||||
with the same name in a different shared library.
|
||||
|
||||
The characteristics of a symbol which matter for resolution are:
|
||||
|
||||
* The symbol name
|
||||
* The symbol version.
|
||||
* Whether the symbol is the default version or not.
|
||||
* Whether the symbol is a definition or a reference or a common symbol.
|
||||
* The symbol visibility.
|
||||
* Whether the symbol is weak or strong (i.e., non-weak).
|
||||
* Whether the symbol is defined in a regular object file being included in the
|
||||
output, or in a shared library.
|
||||
* Whether the symbol is thread local.
|
||||
* Whether the symbol refers to a function or a variable.
|
||||
|
||||
The goal of symbol resolution is to determine the final value of the symbol.
|
||||
After all symbols are resolved, we should know the specific object file or
|
||||
shared library which defines the symbol, and we should know the symbol’s type,
|
||||
size, etc. It is possible that some symbols will remain undefined after all the
|
||||
symbol tables have been read; in general this is only an error if some
|
||||
relocation refers to that symbol.
|
||||
|
||||
At this point I’d like to present a simple algorithm for symbol resolution, but
|
||||
I don’t think I can. I’ll try to hit all the high points, though. Let’s assume
|
||||
that we have two symbols with the same name. Let’s call the symbol we saw first
|
||||
A and the new symbol B. (I’m going to ignore symbol visibility in the algorithm
|
||||
below; the effects of visibility should be obvious, I hope.)
|
||||
|
||||
1. If A has a version:
|
||||
* If B has a version different from A, they are actually different symbols.
|
||||
* If B has the same version as A, they are the same symbol; carry on.
|
||||
* If B does not have a version, and A is the default version of the symbol,
|
||||
they are the same symbol; carry on.
|
||||
* Otherwise B is probably a different symbol. But note that if A and B are
|
||||
both undefined references, then it is possible that A refers to the default
|
||||
version of the symbol but we don’t yet know that. In that case, if B does
|
||||
not have a version, A and B really are the same symbol. We can’t tell until
|
||||
we see the actual definition.
|
||||
2. If A does not have a version:
|
||||
* If B does not have a version, they are the same symbol; carry on.
|
||||
* If B has a version, and it is the default version, they are the same
|
||||
symbol; carry on.
|
||||
* Otherwise, B is probably a different symbol, as above.
|
||||
3. If A is thread local and B is not, or vice-versa, then we have an error.
|
||||
4. If A is an undefined reference:
|
||||
* If B is an undefined reference, then we can complete the resolution, and
|
||||
more or less ignore B.
|
||||
* If B is a definition or a common symbol, then we can resolve A to B.
|
||||
5. If A is a strong definition in an object file:
|
||||
* If B is an undefined reference, then we resolve B to A.
|
||||
* If B is a strong definition in an object file, then we have a multiple
|
||||
definition error.
|
||||
* If B is a weak definition in an object file, then A overrides B. In effect,
|
||||
B is ignored.
|
||||
* If B is a common symbol, then we treat B as an undefined reference.
|
||||
* If B is a definition in a shared library, then A overrides B. The dynamic
|
||||
linker will change all references to B in the shared library to refer to A
|
||||
instead.
|
||||
6. If A is a weak definition in an object file, we act just like the strong
|
||||
definition case, with one exception: if B is a strong definition in an
|
||||
object file. In the original SVR4 linker, this case was treated as a
|
||||
multiple definition error. In the Solaris and GNU linkers, this case is
|
||||
handled by letting B override A.
|
||||
7. If A is a common symbol in an object file:
|
||||
* If B is a common symbol, we set the size of A to be the maximum of the size
|
||||
of A and the size of B, and then treat B as an undefined reference.
|
||||
* If B is a definition in a shared library with function type, then A
|
||||
overrides B (this oddball case is required to correctly handle some Unix
|
||||
system libraries).
|
||||
* Otherwise, we treat A as an undefined reference.
|
||||
8. If A is a definition in a shared library, then if B is a definition in a
|
||||
regular object (strong or weak), it overrides A. Otherwise we act as though
|
||||
A were defined in an object file.
|
||||
9. If A is a common symbol in a shared library, we have a funny case. Symbols
|
||||
in shared libraries must have addresses, so they can’t be common in the same
|
||||
sense as symbols in an object file. But ELF does permit symbols in a shared
|
||||
library to have the type `STT_COMMON` (this is a relatively recent
|
||||
addition). For purposes of symbol resolution, if A is a common symbol in a
|
||||
shared library, we still treat it as a definition, unless B is also a common
|
||||
symbol. In the latter case, B overrides A, and the size of B is set to the
|
||||
maximum of the size of A and the size of B.
|
||||
|
||||
I hope I got all that right.
|
||||
|
||||
More tomorrow, assuming the Internet connection holds up.
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
# Linkers part 13
|
||||
|
||||
## Symbol Versions Redux
|
||||
|
||||
I’ve talked about symbol versions from the linker’s point of view. I think it’s
|
||||
worth discussing them a bit from the user’s point of view.
|
||||
|
||||
As I’ve discussed before, symbol versions are an ELF extension designed to
|
||||
solve a specific problem: making it possible to upgrade a shared library
|
||||
without changing existing executables. That is, they provide backward
|
||||
compatibility for shared libraries. There are a number of related problems
|
||||
which symbol versions do not solve. They do not provide forward compatibility
|
||||
for shared libraries: if you upgrade your executable, you may need to upgrade
|
||||
your shared library also (it would be nice to have a feature to build your
|
||||
executable against an older version of the shared library, but that is
|
||||
difficult to implement in practice). They only work at the shared library
|
||||
interface: they do not help with a change to the ABI of a system call, which is
|
||||
at the kernel interface. They do not help with the problem of sharing
|
||||
incompatible versions of a shared library, as may happen when a complex
|
||||
application is built out of several different existing shared libraries which
|
||||
have incompatible dependencies.
|
||||
|
||||
Despite these limitations, shared library backward compatibility is an
|
||||
important issue. Using symbol versions to ensure backward compatibility
|
||||
requires a careful and rigorous approach. You must start by applying a version
|
||||
to every symbol. If a symbol in the shared library does not have a version,
|
||||
then it is impossible to change it in a backward compatible fashion. Then you
|
||||
must pay close attention to the ABI of every symbol. If the ABI of a symbol
|
||||
changes for any reason, you must provide a copy which implements the old ABI.
|
||||
That copy should be marked with the original version. The new symbol must be
|
||||
given a new version.
|
||||
|
||||
The ABI of a symbol can change in a number of ways. Any change to the parameter
|
||||
types or the return type of a function is an ABI change. Any change in the type
|
||||
of a variable is an ABI change. If a parameter or a return type is a struct or
|
||||
class, then any change in the type of any field is an ABI change–i.e., if a
|
||||
field in a struct points to another struct, and that struct changes, the ABI
|
||||
has changed. If a function is defined to return an instance of an enum, and a
|
||||
new value is added to the enum, that is an ABI change. In other words, even
|
||||
minor changes can be ABI changes. The question you need to ask is: can existing
|
||||
code which has already been compiled continue to use the new symbol with no
|
||||
change? If the answer is no, you have an ABI change, and you must define a new
|
||||
symbol version.
|
||||
|
||||
You must be very careful when writing the symbol implementing the old ABI, if
|
||||
you don’t just copy the existing code. You must be certain that it really does
|
||||
implement the old ABI.
|
||||
|
||||
There are some special challenges when using C++. Adding a new virtual method
|
||||
to a class can be an ABI change for any function which uses that class.
|
||||
Providing the backward compatible version of the class in such a situation is
|
||||
very awkward–there is no natural way to specify the name and version to use for
|
||||
the virtual table or the RTTI information for the old version.
|
||||
|
||||
Naturally, you must never delete any symbols.
|
||||
|
||||
Getting all the details correct, and verifying that you got them correct,
|
||||
requires great attention to detail. Unfortunately, I don’t know of any tools to
|
||||
help people write correct version scripts, or to verify them. Still, if
|
||||
implemented correctly, the results are good: existing executables will continue
|
||||
to run.
|
||||
|
||||
## Static Linking vs. Dynamic Linking
|
||||
|
||||
There is, of course, another way to ensure that existing executables will
|
||||
continue to run: link them statically, without using any shared libraries. That
|
||||
will limit their ABI issues to the kernel interface, which is normally
|
||||
significantly smaller than the library interface.
|
||||
|
||||
There is a performance tradeoff with static linking. A statically linked
|
||||
program does not get the benefit of sharing libraries with other programs
|
||||
executing at the same time. On the other hand, a statically linked program does
|
||||
not have to pay the performance penalty of position independent code when
|
||||
executing within the library.
|
||||
|
||||
Upgrading the shared library is only possible with dynamic linking. Such an
|
||||
upgrade can provide bug fixes and better performance. Also, the dynamic linker
|
||||
can select a version of the shared library appropriate for the specific
|
||||
platform, which can also help performance.
|
||||
|
||||
Static linking permits more reliable testing of the program. You only need to
|
||||
worry about kernel changes, not about shared library changes.
|
||||
|
||||
Some people argue that dynamic linking is always superior. I think there are
|
||||
benefits on both sides, and which choice is best depends on the specific
|
||||
circumstances.
|
||||
|
||||
More on Monday. If you think I should write about any specific linker related
|
||||
topics which have not already been mentioned in the comments, please let me
|
||||
know.
|
||||
|
|
@ -0,0 +1,92 @@
|
|||
# Linkers part 14
|
||||
|
||||
## Link Time Optimization
|
||||
|
||||
I’ve already mentioned some optimizations which are peculiar to the linker:
|
||||
relaxation and garbage collection of unwanted sections. There is another class
|
||||
of optimizations which occur at link time, but are really related to the
|
||||
compiler. The general name for these optimizations is link time optimization or
|
||||
whole program optimization.
|
||||
|
||||
The general idea is that the compiler optimization passes are run at link time.
|
||||
The advantage of running them at link time is that the compiler can then see
|
||||
the entire program. This permits the compiler to perform optimizations which
|
||||
can not be done when sources files are compiled separately. The most obvious
|
||||
such optimization is inlining functions across source files. Another is
|
||||
optimizing the calling sequence for simple functions–e.g., passing more
|
||||
parameters in registers, or knowing that the function will not clobber all
|
||||
registers; this can only be done when the compiler can see all callers of the
|
||||
function. Experience shows that these and other optimizations can bring
|
||||
significant performance benefits.
|
||||
|
||||
Generally these optimizations are implemented by having the compiler write a
|
||||
version of its intermediate representation into the object file, or into some
|
||||
parallel file. The intermediate representation will be the parsed version of
|
||||
the source file, and may already have had some local optimizations applied.
|
||||
Sometimes the object file contains only the compiler intermediate
|
||||
representation, sometimes it also contains the usual object code. In the former
|
||||
case link time optimization is required, in the latter case it is optional.
|
||||
|
||||
I know of two typical ways to implement link time optimization. The first
|
||||
approach is for the compiler to provide a pre-linker. The pre-linker examines
|
||||
the object files looking for stored intermediate representation. When it finds
|
||||
some, it runs the link time optimization passes. The second approach is for the
|
||||
linker proper to call back into the compiler when it finds intermediate
|
||||
representation. This is generally done via some sort of plugin API.
|
||||
|
||||
Although these optimizations happen at link time, they are not part of the
|
||||
linker proper, at least not as I defined it. When the compiler reads the stored
|
||||
intermediate representation, it will eventually generate an object file, one
|
||||
way or another. The linker proper will then process that object file as usual.
|
||||
These optimizations should be thought of as part of the compiler.
|
||||
|
||||
## Initialization Code
|
||||
|
||||
C++ permits globals variables to have constructors and destructors. The global
|
||||
constructors must be run before main starts, and the global destructors must be
|
||||
run after exit is called. Making this work requires the compiler and the linker
|
||||
to cooperate.
|
||||
|
||||
The a.out object file format is rarely used these days, but the GNU a.out
|
||||
linker has an interesting extension. In a.out symbols have a one byte type
|
||||
field. This encodes a bunch of debugging information, and also the section in
|
||||
which the symbol is defined. The a.out object file format only supports three
|
||||
sections–text, data, and bss. Four symbol types are defined as sets: text set,
|
||||
data set, bss set, and absolute set. A symbol with a set type is permitted to
|
||||
be defined multiple times. The GNU linker will not give a multiple definition
|
||||
error, but will instead build a table with all the values of the symbol. The
|
||||
table will start with one word holding the number of entries, and will end with
|
||||
a zero word. In the output file the set symbol will be defined as the address
|
||||
of the start of the table.
|
||||
|
||||
For each C++ global constructor, the compiler would generate a symbol named
|
||||
`__CTOR_LIST__` with the text set type. The value of the symbol in the object
|
||||
file would be the global constructor function. The linker would gather together
|
||||
all the `__CTOR_LIST__` functions into a table. The startup code supplied by
|
||||
the compiler would walk down the `__CTOR_LIST__` table and call each function.
|
||||
Global destructors were handled similarly, with the name `__DTOR_LIST__`.
|
||||
|
||||
Anyhow, so much for a.out. In ELF, global constructors are handled in a fairly
|
||||
similar way, but without using magic symbol types. I’ll describe what gcc does.
|
||||
An object file which defines a global constructor will include a `.ctors`
|
||||
section. The compiler will arrange to link special object files at the very
|
||||
start and very end of the link. The one at the start of the link will define a
|
||||
symbol for the `.ctors` section; that symbol will wind up at the start of the
|
||||
section. The one at the end of the link will define a symbol for the end of the
|
||||
`.ctors` section. The compiler startup code will walk between the two symbols,
|
||||
calling the constructors. Global destructors work similarly, in a `.dtors`
|
||||
section.
|
||||
|
||||
ELF shared libraries work similarly. When the dynamic linker loads a shared
|
||||
library, it will call the function at the `DT_INIT` tag if there is one. By
|
||||
convention the ELF program linker will set this to the function named `_init`,
|
||||
if there is one. Similarly the `DT_FINI` tag is called when a shared library is
|
||||
unloaded, and the program linker will set this to the function named `_fini`.
|
||||
|
||||
As I mentioned earlier, three are also `DT_INIT_ARRAY`, `DT_PREINIT_ARRAY`, and
|
||||
`DT_FINI_ARRAY` tags, which are set based on the `SHT_INIT_ARRAY`,
|
||||
`SHT_PREINIT_ARRAY`, and `SHT_FINI_ARRAY` section types. This is a newer
|
||||
approach in ELF, and does not require relying on special symbol names.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
# Linkers part 15
|
||||
|
||||
## COMDAT sections
|
||||
|
||||
In C++ there are several constructs which do not clearly live in a single
|
||||
place. Examples are inline functions defined in a header file, virtual tables,
|
||||
and typeinfo objects. There must be only a single instance of each of these
|
||||
constructs in the final linked program (actually we could probably get away
|
||||
with multiple copies of a virtual table, but the others must be unique since it
|
||||
is possible to take their address). Unfortunately, there is not necessarily a
|
||||
single object file in which they should be generated. These types of constructs
|
||||
are sometimes described as having vague linkage.
|
||||
|
||||
Linkers implement these features by using *COMDAT* sections (there may be other
|
||||
approaches, but this is the only I know of). COMDAT sections are a special type
|
||||
of section. Each COMDAT section has a special string. When the linker sees
|
||||
multiple COMDAT sections with the same special string, it will only keep one of
|
||||
them.
|
||||
|
||||
For example, when the C++ compiler sees an inline function `f1` defined in a
|
||||
header file, but the compiler is unable to inline the function in all uses
|
||||
(perhaps because something takes the address of the function), the compiler
|
||||
will emit `f1` in a COMDAT section associated with the string `f1`. After the
|
||||
linker sees a COMDAT section `f1`, it will discard all subsequent `f1` COMDAT
|
||||
sections.
|
||||
|
||||
This obviously raises the possibility that there will be two entirely different
|
||||
inline functions named `f1`, defined in different header files. This would be
|
||||
an invalid C++ program, violating the One Definition Rule (often abbreviated
|
||||
ODR). Unfortunately, if no source file included both header files, the
|
||||
compiler would be unable to diagnose the error. And, unfortunately, the linker
|
||||
would simply discard the duplicate COMDAT sections, and would not notice the
|
||||
error either. This is an area where some improvements are needed (at least in
|
||||
the GNU tools; I don’t know whether any other tools diagnose this error
|
||||
correctly).
|
||||
|
||||
The Microsoft PE object file format provides COMDAT sections. These sections
|
||||
can be marked so that duplicate COMDAT sections which do not have identical
|
||||
contents cause an error. That is not as helpful as it seems, as different
|
||||
compiler options may cause valid duplicates to have different contents. The
|
||||
string associated with a COMDAT section is stored in the symbol table.
|
||||
|
||||
Before I learned about the Microsoft PE format, I introduced a different type
|
||||
of COMDAT sections into the GNU ELF linker, following a suggestion from Jason
|
||||
Merrill. Any section whose name starts with “.gnu.linkonce.” is a COMDAT
|
||||
section. The associated string is simply the section name itself. Thus the
|
||||
inline function `f1` would be put into the section “.gnu.linkonce.f1”. This
|
||||
simple implementation works well enough, but it has a flaw in that some
|
||||
functions require data in multiple sections; e.g., the instructions may be in
|
||||
one section and associated static data may be in another section. Since
|
||||
different instances of the inline function may be compiled differently, the
|
||||
linker can not reliably and consistently discard duplicate data (I don’t know
|
||||
how the Microsoft linker handles this problem).
|
||||
|
||||
Recent versions of ELF introduce section groups. These implement an officially
|
||||
sanctioned version of COMDAT in ELF, and avoid the problem of “.gnu.linkonce”
|
||||
sections. I described these briefly in an earlier blog entry. A special section
|
||||
of type `SHT_GROUP` contains a list of section indices in the group. The group
|
||||
is retained or discarded as a whole. The string associated with the group is
|
||||
found in the symbol table. Putting the string in the symbol table makes it
|
||||
awkward to retrieve, but since the string is generally the name of a symbol it
|
||||
means that the string only needs to be stored once in the object file; this is
|
||||
a minor optimization for C++ in which symbol names may be very long.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,87 @@
|
|||
# Linkers part 16
|
||||
|
||||
## C++ Template Instantiation
|
||||
|
||||
There is still more C++ fun at link time, though somewhat less related to the
|
||||
linker proper. A C++ program can declare templates, and instantiate them with
|
||||
specific types. Ideally those specific instantiations will only appear once in
|
||||
a program, not once per source file which instantiates the templates. There are
|
||||
a few ways to make this work.
|
||||
|
||||
For object file formats which support COMDAT and vague linkage, which I
|
||||
described yesterday, the simplest and most reliable mechanism is for the
|
||||
compiler to generate all the template instantiations required for a source file
|
||||
and put them into the object file. They should be marked as COMDAT, so that the
|
||||
linker discards all but one copy. This ensures that all template instantiations
|
||||
will be available at link time, and that the executable will have only one
|
||||
copy. This is what gcc does by default for systems which support it. The
|
||||
obvious disadvantages are the time required to compile all the duplicate
|
||||
template instantiations and the space they take up in the object files. This is
|
||||
sometimes called the Borland model, as this is what Borland’s C++ compiler did.
|
||||
|
||||
Another approach is to not generate any of the template instantiations at
|
||||
compile time. Instead, when linking, if we need a template instantiation which
|
||||
is not found, invoke the compiler to build it. This can be done either by
|
||||
running the linker and looking for error messages or by using a linker plugin
|
||||
to handle an undefined symbol error. The difficulties with this approach are to
|
||||
find the source code to compile and to find the right options to pass to the
|
||||
compiler. Typically the source code is placed into a repository file of some
|
||||
sort at compile time, so that it is available at link time. The complexities of
|
||||
getting the compilation steps right are why this approach is not the default.
|
||||
When it works, though, it can be faster than the duplicate instantiation
|
||||
approach. This is sometimes called the Cfront model.
|
||||
|
||||
gcc also supports explicit template instantiation, which can be used to control
|
||||
exactly where templates are instantiated. This approach can work if you have
|
||||
complete control over your source code base, and can instantiate all required
|
||||
templates in some central place. This approach is used for gcc’s C++ library,
|
||||
libstdc++.
|
||||
|
||||
C++ defines a keyword export which is supposed to permit exporting template
|
||||
definitions in such a way that they can be read back in by the compiler. gcc
|
||||
does not support this keyword. If it worked, it could be a slightly more
|
||||
reliable way of using a repository when using the Cfront model.
|
||||
|
||||
## Exception Frames
|
||||
|
||||
C++ and other languages support exceptions. When an exception is thrown in one
|
||||
function and caught in another, the program needs to reset the stack pointer
|
||||
and registers to the point where the exception is caught. While resetting the
|
||||
stack pointer, the program needs to identify all local variables in the part of
|
||||
the stack being discarded, and run their destructors if any. This process is
|
||||
known as unwinding the stack.
|
||||
|
||||
The information needed to unwind the stack is normally stored in tables in the
|
||||
program. Supporting library code is used to read the tables and perform the
|
||||
necessary operations. I’m not going to describe the details of those tables
|
||||
here. However, there is a linker optimization which applies to them.
|
||||
|
||||
The support libraries need to be able to find the exception tables at runtime
|
||||
when an exception occurs. An exception can be thrown in one shared library and
|
||||
caught in a different shared library, so finding all the required exception
|
||||
tables can be a nontrivial operation. One approach that can be used is to
|
||||
register the exception tables at program startup time or shared library load
|
||||
time. The registration can be done at the right time using the global
|
||||
constructor mechanism.
|
||||
|
||||
However, this approach imposes a runtime cost for exceptions, in that it takes
|
||||
longer for the program to start. Therefore, this is not ideal. The linker can
|
||||
optimize this by building tables which can be used to find the exception
|
||||
tables. The tables built by the GNU linker are sorted for fast lookup by the
|
||||
runtime library. The tables are put into a `PT_GNU_EH_FRAME` segment. The
|
||||
supporting libraries then need a way to look up a segment of this type. This is
|
||||
done via the `dl_iterate_phdr` API provided by the GNU dynamic linker.
|
||||
|
||||
Note that if the compiler believes that the linker will generate a
|
||||
`PT_GNU_EH_FRAME` segment, it won’t generate the startup code to register the
|
||||
exception tables. Thus the linker must not fail to create this segment.
|
||||
|
||||
Since the GNU linker needs to look at the exception tables in order to generate
|
||||
the `PT_GNU_EH_FRAME` segment, it will also optimize by discarding duplicate
|
||||
exception table information.
|
||||
|
||||
I know this is section is rather short on details. I hope the general idea is
|
||||
clear.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
# Linkers part 17
|
||||
|
||||
## Warning Symbols
|
||||
|
||||
The GNU linker supports a weird extension to ELF used to issue warnings when
|
||||
symbols are referenced at link time. This was originally implemented for a.out
|
||||
using a special symbol type. For ELF, I implemented it using a special section
|
||||
name.
|
||||
|
||||
If you create a section named `.gnu.warning.SYMBOL`, then if and when the
|
||||
linker sees an undefined reference to `SYMBOL`, it will issue a warning. The
|
||||
warning is triggered by seeing an undefined symbol with the right name in an
|
||||
object file. Unlike the warning about an undefined symbol, it is not triggered
|
||||
by seeing a relocation entry. The text of the warning is simply the contents of
|
||||
the `.gnu.warning.SYMBOL` section.
|
||||
|
||||
The GNU C library uses this feature to warn about references to symbols like
|
||||
`gets` which are required by standards but are generally considered to be
|
||||
unsafe. This is done by creating a section named `.gnu.warning.gets` in the
|
||||
same object file which defines `gets`.
|
||||
|
||||
The GNU linker also supports another type of warning, triggered by sections
|
||||
named `.gnu.warning` (without the symbol name). If an object file with a
|
||||
section of that name is included in the link, the linker will issue a warning.
|
||||
Again, the text of the warning is simply the contents of the `.gnu.warning`
|
||||
section. I don’t know if anybody actually uses this feature.
|
||||
|
||||
Short entry today, more tomorrow.
|
||||
|
|
@ -0,0 +1,53 @@
|
|||
# Linkers part 18
|
||||
|
||||
## Incremental Linking
|
||||
|
||||
Often a programmer will make change a single source file and recompile and
|
||||
relink the application. A standard linker will need to read all the input
|
||||
objects and libraries in order to regenerate the executable with the change.
|
||||
For a large application, this is a lot of work. If only one input object file
|
||||
changed, it is a lot more work than really needs to be done. One solution is to
|
||||
use an incremental linker. An incremental linker makes incremental changes to
|
||||
an existing executable or shared library, rather than rebuilding them from
|
||||
scratch.
|
||||
|
||||
I’ve never actually written or worked on an incremental linker, but the general
|
||||
idea is straightforward enough. When the linker writes the output file, it must
|
||||
attach additional information.
|
||||
|
||||
* The linker must create a mapping of object files to areas in the output file,
|
||||
so that an incremental link will know what to remove when replacing an object
|
||||
file.
|
||||
* The linker must retain all the relocations for each input object which refer
|
||||
to symbols defined in other objects, so that it can reprocess them when
|
||||
symbols change. The linker should store the relocations mapped by symbol, so
|
||||
that it can quickly find the relevant relocations.
|
||||
* The linker should leave extra space in the text and data segments, to allow
|
||||
for object files to grow to a limited extent without requiring rewriting the
|
||||
whole executable. It must keep a map of where this extra space is, as it will
|
||||
tend to move over time over the course of incremental links.
|
||||
* The linker should keep a list of object file timestamps in the output file,
|
||||
so that it can quickly determine which objects have changed.
|
||||
|
||||
With this information, the linker can identify which object files have changed
|
||||
since the last time the output file was linked, and replace them in the
|
||||
existing output file. When an object file changes, the linker can identify all
|
||||
the relocations which refer to symbols defined in the object file, and
|
||||
reprocess them.
|
||||
|
||||
When an object file gets too large to fit in the available space in a text or
|
||||
data segment, then the linker has the option of creating additional text or
|
||||
data segments at different addresses. This requires some care to ensure that
|
||||
the new code does not collide with the heap, depending upon how the local
|
||||
malloc implementation works. Alternatively, the incremental linker could fall
|
||||
back on doing a full link, and allocating more space again.
|
||||
|
||||
Incremental linking can greatly speed up the edit/compile/debug cycle.
|
||||
Unfortunately it is not implemented in most common linkers. Of course an
|
||||
incremental link is not equivalent to a final link, and in particular some
|
||||
linker optimizations are difficult to implement while acting incrementally. An
|
||||
incremental link is really only suitable for use during the development cycle,
|
||||
which is course the time when the speed of the linker is most important.
|
||||
|
||||
More on Monday.
|
||||
|
|
@ -0,0 +1,139 @@
|
|||
# Linkers part 19
|
||||
|
||||
I’ve pretty much run out of linker topics. Unless I think of something new, I’ll make tomorrow’s post be the last one, for a total of 20.
|
||||
|
||||
## __start and __stop Symbols
|
||||
|
||||
A quick note about another GNU linker extension. If the linker sees a section
|
||||
in the output file which can be part of a C variable name–the name contains
|
||||
only alphanumeric characters or underscore–the linker will automatically define
|
||||
symbols marking the start and stop of the section. Note that this is not true
|
||||
of most section names, as by convention most section names start with a period.
|
||||
But the name of a section can be any string; it doesn’t have to start with a
|
||||
period. And when that happens for section `NAME`, the GNU linker will define
|
||||
the symbols `__start_NAME` and `__stop_NAME` to the address of the beginning
|
||||
and the end of section, respectively.
|
||||
|
||||
This is convenient for collecting some information in several different object
|
||||
files, and then referring to it in the code. For example, the GNU C library
|
||||
uses this to keep a list of functions which may be called to free memory. The
|
||||
`__start` and `__stop` symbols are used to walk through the list.
|
||||
|
||||
In C code, these symbols should be declared as something like extern char
|
||||
`__start_NAME[]`. For an extern array the value of the symbol and the value of
|
||||
the variable are the same.
|
||||
|
||||
## Byte Swapping
|
||||
|
||||
The new linker I am working on, gold, is written in C++. One of the attractions
|
||||
was to use template specialization to do efficient byte swapping. Any linker
|
||||
which can be used in a cross-compiler needs to be able to swap bytes when
|
||||
writing them out, in order to generate code for a big-endian system while
|
||||
running on a little-endian system, or vice-versa. The GNU linker always stores
|
||||
data into memory a byte at a time, which is unnecessary for a native linker.
|
||||
Measurements from a few years ago showed that this took about 5% of the
|
||||
linker’s CPU time. Since the native linker is by far the most common case, it
|
||||
is worth avoiding this penalty.
|
||||
|
||||
In C++, this can be done using templates and template specialization. The idea
|
||||
is to write a template for writing out the data. Then provide two
|
||||
specializations of the template, one for a linker of the same endianness and
|
||||
one for a linker of the opposite endianness. Then pick the one to use at
|
||||
compile time. The code looks this; I’m only showing the 16-bit case for
|
||||
simplicity.
|
||||
|
||||
```cpp
|
||||
// Endian simply indicates whether the host is big endian or not.
|
||||
|
||||
struct Endian
|
||||
{
|
||||
public:
|
||||
// Used for template specializations.
|
||||
static const bool host_big_endian = __BYTE_ORDER == __BIG_ENDIAN;
|
||||
};
|
||||
|
||||
// Valtype_base is a template based on size (8, 16, 32, 64) which
|
||||
// defines the type Valtype as the unsigned integer of the specified
|
||||
// size.
|
||||
|
||||
template
|
||||
struct Valtype_base;
|
||||
|
||||
template<>
|
||||
struct Valtype_base<16>
|
||||
{
|
||||
typedef uint16_t Valtype;
|
||||
};
|
||||
|
||||
// Convert_endian is a template based on size and on whether the host
|
||||
// and target have the same endianness. It defines the type Valtype
|
||||
// as Valtype_base does, and also defines a function convert_host
|
||||
// which takes an argument of type Valtype and returns the same value,
|
||||
// but swapped if the host and target have different endianness.
|
||||
|
||||
template
|
||||
struct Convert_endian;
|
||||
|
||||
template
|
||||
struct Convert_endian
|
||||
{
|
||||
typedef typename Valtype_base::Valtype Valtype;
|
||||
|
||||
static inline Valtype
|
||||
convert_host(Valtype v)
|
||||
{ return v; }
|
||||
};
|
||||
|
||||
template<>
|
||||
struct Convert_endian<16, false>
|
||||
{
|
||||
typedef Valtype_base<16>::Valtype Valtype;
|
||||
|
||||
static inline Valtype
|
||||
convert_host(Valtype v)
|
||||
{ return bswap_16(v); }
|
||||
};
|
||||
|
||||
// Convert is a template based on size and on whether the target is
|
||||
// big endian. It defines Valtype and convert_host like
|
||||
// Convert_endian. That is, it is just like Convert_endian except in
|
||||
// the meaning of the second template parameter.
|
||||
|
||||
template
|
||||
struct Convert
|
||||
{
|
||||
typedef typename Valtype_base::Valtype Valtype;
|
||||
|
||||
static inline Valtype
|
||||
convert_host(Valtype v)
|
||||
{
|
||||
return Convert_endian
|
||||
::convert_host(v);
|
||||
}
|
||||
};
|
||||
|
||||
// Swap is a template based on size and on whether the target is big
|
||||
// endian. It defines the type Valtype and the functions readval and
|
||||
// writeval. The functions read and write values of the appropriate
|
||||
// size out of buffers, swapping them if necessary.
|
||||
|
||||
template
|
||||
struct Swap
|
||||
{
|
||||
typedef typename Valtype_base::Valtype Valtype;
|
||||
|
||||
static inline Valtype
|
||||
readval(const Valtype* wv)
|
||||
{ return Convert::convert_host(*wv); }
|
||||
|
||||
static inline void
|
||||
writeval(Valtype* wv, Valtype v)
|
||||
{ *wv = Convert::convert_host(v); }
|
||||
};
|
||||
```
|
||||
|
||||
Now, for example, the linker reads a 16-bit big-endian value using
|
||||
`Swap<16,true>::readval`. This works because the linker always knows how much
|
||||
data to swap in, and it always knows whether it is reading big- or
|
||||
little-endian data.
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
# Linkers part 2
|
||||
|
||||
I’m back, and I’m still doing the linker technical introduction.
|
||||
|
||||
Shared libraries were invented as an optimization for virtual memory systems
|
||||
running many processes simultaneously. People noticed that there is a set of
|
||||
basic functions which appear in almost every program. Before shared libraries,
|
||||
in a system which runs multiple processes simultaneously, that meant that
|
||||
almost every process had a copy of exactly the same code. This suggested that
|
||||
on a virtual memory system it would be possible to arrange that code so that a
|
||||
single copy could be shared by every process using it. The virtual memory
|
||||
system would be used to map the single copy into the address space of each
|
||||
process which needed it. This would require less physical memory to run
|
||||
multiple programs, and thus yield better performance.
|
||||
|
||||
I believe the first implementation of shared libraries was on SVR3, based on
|
||||
COFF. This implementation was simple, and basically assigned each shared
|
||||
library a fixed portion of the virtual address space. This did not require any
|
||||
significant changes to the linker. However, requiring each shared library to
|
||||
reserve an appropriate portion of the virtual address space was inconvenient.
|
||||
|
||||
SunOS4 introduced a more flexible version of shared libraries, which was later
|
||||
picked up by SVR4. This implementation postponed some of the operation of the
|
||||
linker to runtime. When the program started, it would automatically run a
|
||||
limited version of the linker which would link the program proper with the
|
||||
shared libraries. The version of the linker which runs when the program starts
|
||||
is known as the dynamic linker. When it is necessary to distinguish them, I
|
||||
will refer to the version of the linker which creates the program as the
|
||||
program linker. This type of shared libraries was a significant change to the
|
||||
traditional program linker: it now had to build linking information which could
|
||||
be used efficiently at runtime by the dynamic linker.
|
||||
|
||||
That is the end of the introduction. You should now understand the basics of
|
||||
what a linker does. I will now turn to how it does it.
|
||||
|
||||
## Basic Linker Data Types
|
||||
|
||||
The linker operates on a small number of basic data types: symbols,
|
||||
relocations, and contents. These are defined in the input object files. Here is
|
||||
an overview of each of these.
|
||||
|
||||
A symbol is basically a name and a value. Many symbols represent static objects
|
||||
in the original source code–that is, objects which exist in a single place for
|
||||
the duration of the program. For example, in an object file generated from C
|
||||
code, there will be a symbol for each function and for each global and static
|
||||
variable. The value of such a symbol is simply an offset into the contents.
|
||||
This type of symbol is known as a defined symbol. It’s important not to confuse
|
||||
the value of the symbol representing the variable `my_global_var` with the
|
||||
value of `my_global_var` itself. The value of the symbol is roughly the address
|
||||
of the variable: the value you would get from the expression
|
||||
`&my_global_var` in C.
|
||||
|
||||
Symbols are also used to indicate a reference to a name defined in a different
|
||||
object file. Such a reference is known as an undefined symbol. There are other
|
||||
less commonly used types of symbols which I will describe later.
|
||||
|
||||
During the linking process, the linker will assign an address to each defined
|
||||
symbol, and will resolve each undefined symbol by finding a defined symbol with
|
||||
the same name.
|
||||
|
||||
A relocation is a computation to perform on the contents. Most relocations
|
||||
refer to a symbol and to an offset within the contents. Many relocations will
|
||||
also provide an additional operand, known as the addend. A simple, and commonly
|
||||
used, relocation is “set this location in the contents to the value of this
|
||||
symbol plus this addend.” The types of computations that relocations do are
|
||||
inherently dependent on the architecture of the processor for which the linker
|
||||
is generating code. For example, RISC processors which require two or more
|
||||
instructions to form a memory address will have separate relocations to be
|
||||
used with each of those instructions; for example, “set this location in the
|
||||
contents to the lower 16 bits of the value of this symbol.”
|
||||
|
||||
During the linking process, the linker will perform all of the relocation
|
||||
computations as directed. A relocation in an object file may refer to an
|
||||
undefined symbol. If the linker is unable to resolve that symbol, it will
|
||||
normally issue an error (but not always: for some symbol types or some
|
||||
relocation types an error may not be appropriate).
|
||||
|
||||
The contents are what memory should look like during the execution of the
|
||||
program. Contents have a size, an array of bytes, and a type. They contain the
|
||||
machine code generated by the compiler and assembler (known as text). They
|
||||
contain the values of initialized variables (data). They contain static
|
||||
unnamed data like string constants and switch tables (read-only data or rdata).
|
||||
They contain uninitialized variables, in which case the array of bytes is
|
||||
generally omitted and assumed to contain only zeroes (bss). The compiler and
|
||||
the assembler work hard to generate exactly the right contents, but the linker
|
||||
really doesn’t care about them except as raw data. The linker reads the
|
||||
contents from each file, concatenates them all together sorted by type,
|
||||
applies the relocations, and writes the result into the executable file.
|
||||
|
||||
## Basic Linker Operation
|
||||
|
||||
At this point we already know enough to understand the basic steps used by
|
||||
every linker.
|
||||
|
||||
* Read the input object files. Determine the length and type of the contents.
|
||||
Read the symbols.
|
||||
* Build a symbol table containing all the symbols, linking undefined symbols to
|
||||
their definitions.
|
||||
* Decide where all the contents should go in the output executable file, which
|
||||
means deciding where they should go in memory when the program runs.
|
||||
* Read the contents data and the relocations. Apply the relocations to the
|
||||
contents. Write the result to the output file.
|
||||
* Optionally write out the complete symbol table with the final values of the
|
||||
symbols.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
# Linkers part 20
|
||||
|
||||
This will be my last blog posting on linkers for the time being. Tomorrow my
|
||||
blog will return to its usual trivialities. People who are specifically
|
||||
interested in linker information are warned to stop reading with this post.
|
||||
|
||||
I’ll close the series with a short update on gold, the new linker I’ve been
|
||||
working on. It currently (September 25, 2007) can create executables. It can
|
||||
not create shared libraries or relocateable objects. It has very limited
|
||||
support for linker scripts–enough to read `/usr/lib/libc.so` on a GNU/Linux
|
||||
system. It doesn’t have any interesting new features at this point. It only
|
||||
supports x86. The focus to date has been entirely on speed. It is written to be
|
||||
multi-threaded, but the threading support has not been hooked in yet.
|
||||
|
||||
By way of example, when linking a 900M C++ executable, the GNU linker (version
|
||||
2.16.91 20060118 on an Ubuntu based system) took 700 seconds of user time, 24
|
||||
seconds of system time, and 16 minutes of wall time. gold took 7 seconds of
|
||||
user time, 3 seconds of system time, and 30 seconds of wall time. So while I
|
||||
can’t promise that it will stay as fast as all features are added, it’s in a
|
||||
pretty good position at the moment.
|
||||
|
||||
I’m the main developer on gold, but I’m not the only person working on it. A
|
||||
few other people are also making improvements.
|
||||
|
||||
The goal is to release gold as a free program, ideally as part of the GNU
|
||||
binutils. I want it to be more nearly feature complete before doing this,
|
||||
though. It needs to at least support `-shared` and `-r`. I doubt gold will ever
|
||||
support all of the features of the GNU linker. I doubt it will ever support the
|
||||
full GNU linker script language, although I do plan to support enough to link
|
||||
the Linux kernel.
|
||||
|
||||
Future plans for gold, once it actually works, include incremental linking and
|
||||
more far-reaching speed improvements.
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
# Linkers part 3
|
||||
|
||||
Continuing notes on linkers.
|
||||
|
||||
## Address Spaces
|
||||
|
||||
An address space is simply a view of memory, in which each byte has an address.
|
||||
The linker deals with three distinct types of address space.
|
||||
|
||||
Every input object file is a small address space: the contents have addresses,
|
||||
and the symbols and relocations refer to the contents by addresses.
|
||||
|
||||
The output program will be placed at some location in memory when it runs.
|
||||
This is the output address space, which I generally refer to as using virtual
|
||||
memory addresses.
|
||||
|
||||
The output program will be loaded at some location in memory. This is the load
|
||||
memory address. On typical Unix systems virtual memory addresses and load
|
||||
memory addresses are the same. On embedded systems they are often different;
|
||||
for example, the initialized data (the initial contents of global or static
|
||||
variables) may be loaded into ROM at the load memory address, and then copied
|
||||
into RAM at the virtual memory address.
|
||||
|
||||
Shared libraries can normally be run at different virtual memory address in
|
||||
different processes. A shared library has a base address when it is created;
|
||||
this is often simply zero. When the dynamic linker copies the shared library
|
||||
into the virtual memory space of a process, it must apply relocations to
|
||||
adjust the shared library to run at its virtual memory address. Shared library
|
||||
systems minimize the number of relocations which must be applied, since they
|
||||
take time when starting the program.
|
||||
|
||||
## Object File Formats
|
||||
|
||||
As I said above, an assembler turns human readable assembly language into an
|
||||
object file. An object file is a binary data file written in a format designed
|
||||
as input to the linker. The linker generates an executable file. This
|
||||
executable file is a binary data file written in a format designed as input for
|
||||
the operating system or the loader (this is true even when linking dynamically,
|
||||
as normally the operating system loads the executable before invoking the
|
||||
dynamic linker to begin running the program). There is no logical requirement
|
||||
that the object file format resemble the executable file format. However,
|
||||
in practice they are normally very similar.
|
||||
|
||||
Most object file formats define sections. A section typically holds memory
|
||||
contents, or it may be used to hold other types of data. Sections generally
|
||||
have a name, a type, a size, an address, and an associated array of data.
|
||||
|
||||
Object file formats may be classed in two general types: record oriented and
|
||||
section oriented.
|
||||
|
||||
A record oriented object file format defines a series of records of varying
|
||||
size. Each record starts with some special code, and may be followed by data.
|
||||
Reading the object file requires reading it from the begininng and processing
|
||||
each record. Records are used to describe symbols and sections. Relocations may
|
||||
be associated with sections or may be specified by other records. IEEE-695
|
||||
and Mach-O are record oriented object file formats used today.
|
||||
|
||||
In a section oriented object file format the file header describes a section
|
||||
table with a specified number of sections. Symbols may appear in a separate
|
||||
part of the object file described by the file header, or they may appear in a
|
||||
special section. Relocations may be attached to sections, or they may appear in
|
||||
separate sections. The object file may be read by reading the section table,
|
||||
and then reading specific sections directly. ELF, COFF, PE, and a.out are
|
||||
section oriented object file formats.
|
||||
|
||||
Every object file format needs to be able to represent debugging information.
|
||||
Debugging informations is generated by the compiler and read by the debugger.
|
||||
In general the linker can just treat it like any other type of data. However,
|
||||
in practice the debugging information for a program can be larger than the
|
||||
actual program itself. The linker can use various techniques to reduce the
|
||||
amount of debugging information, thus reducing the size of the executable.
|
||||
This can speed up the link, but requires the linker to understand the
|
||||
debugging information.
|
||||
|
||||
The a.out object file format stores debugging information using special strings
|
||||
in the symbol table, known as stabs. These special strings are simply the names
|
||||
of symbols with a special type. This technique is also used by some variants of
|
||||
ECOFF, and by older versions of Mach-O.
|
||||
|
||||
The COFF object file format stores debugging information using special fields
|
||||
in the symbol table. This type information is limited, and is completely
|
||||
inadequate for C++. A common technique to work around these limitations is to
|
||||
embed stabs strings in a COFF section.
|
||||
|
||||
The ELF object file format stores debugging information in sections with
|
||||
special names. The debugging information can be stabs strings or the DWARF
|
||||
debugging format.
|
||||
|
||||
More next week.
|
||||
|
|
@ -0,0 +1,177 @@
|
|||
# Linkers part 4
|
||||
|
||||
## Shared Libraries
|
||||
|
||||
We’ve talked a bit about what object files and executables look like, so what
|
||||
do shared libraries look like? I’m going to focus on ELF shared libraries as
|
||||
used in SVR4 (and GNU/Linux, etc.), as they are the most flexible shared
|
||||
library implementation and the one I know best.
|
||||
|
||||
Windows shared libraries, known as DLLs, are less flexible in that you have to
|
||||
compile code differently depending on whether it will go into a shared library
|
||||
or not. You also have to express symbol visibility in the source code. This is
|
||||
not inherently bad, and indeed ELF has picked up some of these ideas over time,
|
||||
but the ELF format makes more decisions at link time and is thus more powerful.
|
||||
|
||||
When the program linker creates a shared library, it does not yet know which
|
||||
virtual address that shared library will run at. In fact, in different
|
||||
processes, the same shared library will run at different address, depending on
|
||||
the decisions made by the dynamic linker. This means that shared library code
|
||||
must be position independent. More precisely, it must be position independent
|
||||
after the dynamic linker has finished loading it. It is always possible for the
|
||||
dynamic linker to convert any piece of code to run at any virtual address,
|
||||
given sufficient relocation information. However, performing the reloc
|
||||
computations must be done every time the program starts, implying that it will
|
||||
start more slowly. Therefore, any shared library system seeks to generate
|
||||
position independent code which requires a minimal number of relocations to be
|
||||
applied at runtime, while still running at close to the runtime efficiency of
|
||||
position dependent code.
|
||||
|
||||
An additional complexity is that ELF shared libraries were designed to be
|
||||
roughly equivalent to ordinary archives. This means that by default the main
|
||||
executable may override symbols in the shared library, such that references in
|
||||
the shared library will call the definition in the executable, even if the
|
||||
shared library also defines that same symbol. For example, an executable may
|
||||
define its own version of `malloc`. The C library also defines `malloc`, and
|
||||
the C library contains code which calls `malloc`. If the executable defines
|
||||
`malloc` itself, it will override the function in the C library. When some
|
||||
other function in the C library calls `malloc`, it will call the definition in
|
||||
the executable, not the definition in the C library.
|
||||
|
||||
There are thus different requirements pulling in different directions for any
|
||||
specific ELF implementation. The right implementation choices will depend on
|
||||
the characteristics of the processor. That said, most, but not all, processors
|
||||
make fairly similar decisions. I will describe the common case here. An example
|
||||
of a processor which uses the common case is the i386; an example of a
|
||||
processor which make some different decisions is the PowerPC.
|
||||
|
||||
In the common case, code may be compiled in two different modes. By default,
|
||||
code is position dependent. Putting position dependent code into a shared
|
||||
library will cause the program linker to generate a lot of relocation
|
||||
information, and cause the dynamic linker to do a lot of processing at
|
||||
runtime. Code may also be compiled in position independent mode, typically
|
||||
with the `-fpic` option. Position independent code is slightly slower when it
|
||||
calls a non-static function or refers to a global or static variable. However,
|
||||
it requires much less relocation information, and thus the dynamic linker will
|
||||
start the program faster.
|
||||
|
||||
Position independent code will call non-static functions via the *Procedure
|
||||
Linkage Table* or *PLT*. This PLT does not exist in .o files. In a .o file, use
|
||||
of the PLT is indicated by a special relocation. When the program linker
|
||||
processes such a relocation, it will create an entry in the PLT. It will
|
||||
adjust the instruction such that it becomes a PC-relative call to the PLT
|
||||
entry. PC-relative calls are inherently position independent and thus do not
|
||||
require a relocation entry themselves. The program linker will create a
|
||||
relocation for the PLT entry which tells the dynamic linker which symbol is
|
||||
associated with that entry. This process reduces the number of dynamic
|
||||
relocations in the shared library from one per function call to one per
|
||||
function called.
|
||||
|
||||
Further, PLT entries are normally relocated lazily by the dynamic linker. On
|
||||
most ELF systems this laziness may be overridden by setting the LD_BIND_NOW
|
||||
environment variable when running the program. However, by default, the dynamic
|
||||
linker will not actually apply a relocation to the PLT until some code actually
|
||||
calls the function in question. This also speeds up startup time, in that many
|
||||
invocations of a program will not call every possible function. This is
|
||||
particularly true when considering the shared C library, which has many more
|
||||
function calls than any typical program will execute.
|
||||
|
||||
In order to make this work, the program linker initializes the PLT entries to
|
||||
load an index into some register or push it on the stack, and then to branch to
|
||||
common code. The common code calls back into the dynamic linker, which uses the
|
||||
index to find the appropriate PLT relocation, and uses that to find the
|
||||
function being called. The dynamic linker then initializes the PLT entry with
|
||||
the address of the function, and then jumps to the code of the function. The
|
||||
next time the function is called, the PLT entry will branch directly to the
|
||||
function.
|
||||
|
||||
Before giving an example, I will talk about the other major data structure in
|
||||
position independent code, the *Global Offset Table* or *GOT*. This is used for
|
||||
global and static variables. For every reference to a global variable from
|
||||
position independent code, the compiler will generate a load from the GOT to
|
||||
get the address of the variable, followed by a second load to get the actual
|
||||
value of the variable. The address of the GOT will normally be held in a
|
||||
register, permitting efficient access. Like the PLT, the GOT does not exist in
|
||||
a .o file, but is created by the program linker. The program linker will create
|
||||
the dynamic relocations which the dynamic linker will use to initialize the GOT
|
||||
at runtime. Unlike the PLT, the dynamic linker always fully initializes the GOT
|
||||
when the program starts.
|
||||
|
||||
For example, on the i386, the address of the GOT is held in the register
|
||||
`%ebx`. This register is initialized at the entry to each function in position
|
||||
independent code. The initialization sequence varies from one compiler to
|
||||
another, but typically looks something like this:
|
||||
|
||||
```asm
|
||||
call __i686.get_pc_thunk.bx
|
||||
add $offset,%ebx
|
||||
```
|
||||
|
||||
The function `__i686.get_pc_thunk.bx` simply looks like this:
|
||||
|
||||
```asm
|
||||
mov (%esp),%ebx
|
||||
ret
|
||||
```
|
||||
|
||||
This sequence of instructions uses a position independent sequence to get the
|
||||
address at which it is running. Then is uses an offset to get the address of
|
||||
the GOT. Note that this requires that the GOT always be a fixed offset from the
|
||||
code, regardless of where the shared library is loaded. That is, the dynamic
|
||||
linker must load the shared library as a fixed unit; it may not load different
|
||||
parts at varying addresses.
|
||||
|
||||
Global and static variables are now read or written by first loading the
|
||||
address via a fixed offset from `%ebx`. The program linker will create dynamic
|
||||
relocations for each entry in the GOT, telling the dynamic linker how to
|
||||
initialize the entry. These relocations are of type `GLOB_DAT`.
|
||||
|
||||
For function calls, the program linker will set up a PLT entry to look like
|
||||
this:
|
||||
|
||||
```asm
|
||||
jmp *offset(%ebx)
|
||||
pushl #index
|
||||
jmp first_plt_entry
|
||||
```
|
||||
|
||||
The program linker will allocate an entry in the GOT for each entry in the
|
||||
PLT. It will create a dynamic relocation for the GOT entry of type `JMP_SLOT`.
|
||||
It will initialize the GOT entry to the base address of the shared library plus
|
||||
the address of the second instruction in the code sequence above. When the
|
||||
dynamic linker does the initial lazy binding on a `JMP_SLOT` reloc, it will
|
||||
simply add the difference between the shared library load address and the
|
||||
shared library base address to the GOT entry. The effect is that the first jmp
|
||||
instruction will jump to the second instruction, which will push the index
|
||||
entry and branch to the first PLT entry. The first PLT entry is special, and
|
||||
looks like this:
|
||||
|
||||
```asm
|
||||
pushl 4(%ebx)
|
||||
jmp *8(%ebx)
|
||||
```
|
||||
|
||||
This references the second and third entries in the GOT. The dynamic linker
|
||||
will initialize them to have appropriate values for a callback into the dynamic
|
||||
linker itself. The dynamic linker will use the index pushed by the first code
|
||||
sequence to find the `JMP_SLOT` relocation. When the dynamic linker determines
|
||||
the function to be called, it will store the address of the function into the
|
||||
GOT entry references by the first code sequence. Thus, the next time the
|
||||
function is called, the jmp instruction will branch directly to the right code.
|
||||
|
||||
That was a fast pass over a lot of details, but I hope that it conveys the
|
||||
main idea. It means that for position independent code on the i386, every call
|
||||
to a global function requires one extra instruction after the first time it is
|
||||
called. Every reference to a global or static variable requires one extra
|
||||
instruction. Almost every function uses four extra instructions when it starts
|
||||
to initialize `%ebx` (leaf functions which do not refer to any global variables
|
||||
do not need to initialize `%ebx`). This all has some negative impact on the
|
||||
program cache. This is the runtime performance penalty paid to let the dynamic
|
||||
linker start the program quickly.
|
||||
|
||||
On other processors, the details are naturally different. However, the general
|
||||
flavour is similar: position independent code in a shared library starts faster
|
||||
and runs slightly slower.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
# Linkers part 5
|
||||
|
||||
## Shared Libraries Redux
|
||||
|
||||
Yesterday I talked about how shared libraries work. I realized that I should
|
||||
say something about how linkers implement shared libraries. This discussion
|
||||
will again be ELF specific.
|
||||
|
||||
When the program linker puts position dependent code into a shared library, it
|
||||
has to copy more of the relocations from the object file into the shared
|
||||
library. They will become dynamic relocations computed by the dynamic linker at
|
||||
runtime. Some relocations do not have to be copied; for example, a PC relative
|
||||
relocation to a symbol which is local to shared library can be fully resolved
|
||||
by the program linker, and does not require a dynamic reloc. However, note that
|
||||
a PC relative relocation to a global symbol does require a dynamic relocation;
|
||||
otherwise, the main executable would not be able to override the symbol. Some
|
||||
relocations have to exist in the shared library, but do not need to be actual
|
||||
copies of the relocations in the object file; for example, a relocation which
|
||||
computes the absolute address of symbol which is local to the shared library
|
||||
can often be replaced with a `RELATIVE` reloc, which simply directs the dynamic
|
||||
linker to add the difference between the shared library’s load address and its
|
||||
base address. The advantage of using a `RELATIVE` reloc is that the dynamic
|
||||
linker can compute it quickly at runtime, because it does not require
|
||||
determining the value of a symbol.
|
||||
|
||||
For position independent code, the program linker has a harder job. The
|
||||
compiler and assembler will cooperate to generate special relocs for position
|
||||
independent code. Although details differ among processors, there will
|
||||
typically be a `PLT` reloc and a `GOT` reloc. These relocs will direct the program
|
||||
linker to add an entry to the PLT or the GOT, as well as performing some
|
||||
computation. For example, on the i386 a function call in position independent
|
||||
code will generate a `R_386_PLT32` reloc. This reloc will refer to a symbol as
|
||||
usual. It will direct the program linker to add a PLT entry for that symbol,
|
||||
if one does not already exist. The computation of the reloc is then a
|
||||
PC-relative reference to the PLT entry. (The `32` in the name of the reloc
|
||||
refers to the size of the reference, which is 32 bits). Yesterday I described
|
||||
how on the i386 every PLT entry also has a corresponding GOT entry, so the
|
||||
`R_386_PLT32` reloc actually directs the program linker to create both a PLT
|
||||
entry and a GOT entry.
|
||||
|
||||
When the program linker creates an entry in the PLT or the GOT, it must also
|
||||
generate a dynamic reloc to tell the dynamic linker about the entry. This will
|
||||
typically be a `JMP_SLOT` or `GLOB_DAT` relocation.
|
||||
|
||||
This all means that the program linker must keep track of the PLT entry and the
|
||||
GOT entry for each symbol. Initially, of course, there will be no such entries.
|
||||
When the linker sees a PLT or GOT reloc, it must check whether the symbol
|
||||
referenced by the reloc already has a PLT or GOT entry, and create one if it
|
||||
does not. Note that it is possible for a single symbol to have both a PLT entry
|
||||
and a GOT entry; this will happen for position independent code which both
|
||||
calls a function and also takes its address.
|
||||
|
||||
The dynamic linker’s job for the PLT and GOT tables is to simply compute the
|
||||
`JMP_SLOT` and `GLOB_DAT` relocs at runtime. The main complexity here is the
|
||||
lazy evaluation of PLT entries which I described yesterday.
|
||||
|
||||
The fact that C permits taking the address of a function introduces an
|
||||
interesting wrinkle. In C you are permitted to take the address of a function,
|
||||
and you are permitted to compare that address to another function address. The
|
||||
problem is that if you take the address of a function in a shared library, the
|
||||
natural result would be to get the address of the PLT entry. After all, that is
|
||||
address to which a call to the function will jump. However, each shared library
|
||||
has its own PLT, and thus the address of a particular function would differ in
|
||||
each shared library. That means that comparisons of function pointers generated
|
||||
in different shared libraries may be different when they should be the same.
|
||||
This is not a purely hypothetical problem; when I did a port which got it
|
||||
wrong, before I fixed the bug I saw failures in the Tcl shared library when it
|
||||
compared function pointers.
|
||||
|
||||
The fix for this bug on most processors is a special marking for a symbol which
|
||||
has a PLT entry but is not defined. Typically the symbol will be marked as
|
||||
undefined, but with a non-zero value–the value will be set to the address of
|
||||
the PLT entry. When the dynamic linker is searching for the value of a symbol
|
||||
to use for a reloc other than a `JMP_SLOT` reloc, if it finds such a specially
|
||||
marked symbol, it will use the non-zero value. This will ensure that all
|
||||
references to the symbol which are not function calls will use the same value.
|
||||
To make this work, the compiler and assembler must make sure that any reference
|
||||
to a function which does not involve calling it will not carry a standard PLT
|
||||
reloc. This special handling of function addresses needs to be implemented in
|
||||
both the program linker and the dynamic linker.
|
||||
|
||||
## ELF Symbols
|
||||
|
||||
OK, enough about shared libraries. Let’s go over ELF symbols in more detail.
|
||||
I’m not going to lay out the exact data structures–go to the ELF ABI for that.
|
||||
I’m going to take about the different fields and what they mean. Many of the
|
||||
different types of ELF symbols are also used by other object file formats, but
|
||||
I won’t cover that.
|
||||
|
||||
An entry in an ELF symbol table has eight pieces of information: a name, a
|
||||
value, a size, a section, a binding, a type, a visibility, and undefined
|
||||
additional information (currently there are six undefined bits, though more may
|
||||
be added). An ELF symbol defined in a shared object may also have an associated
|
||||
version name.
|
||||
|
||||
The name is obvious.
|
||||
|
||||
For an ordinary defined symbol, the section is some section in the file
|
||||
(specifically, the symbol table entry holds an index into the section table).
|
||||
For an object file the value is relative to the start of the section. For an
|
||||
executable the value is an absolute address. For a shared library the value is
|
||||
relative to the base address.
|
||||
|
||||
For an undefined reference symbol, the section index is the special value
|
||||
`SHN_UNDEF` which has the value `0`. A section index of `SHN_ABS` (`0xfff1`)
|
||||
indicates that the value of the symbol is an absolute value, not relative to
|
||||
any section.
|
||||
|
||||
A section index of `SHN_COMMON` (`0xfff2`) indicates a common symbol. Common
|
||||
symbols were invented to handle Fortran common blocks, and they are also often
|
||||
used for uninitialized global variables in C. A common symbol has unusual
|
||||
semantics. Common symbols have a value of zero, but set the size field to the
|
||||
desired size. If one object file has a common symbol and another has a
|
||||
definition, the common symbol is treated as an undefined reference. If there is
|
||||
no definition for a common symbol, the program linker acts as though it saw a
|
||||
definition initialized to zero of the appropriate size. Two object files may
|
||||
have common symbols of different sizes, in which case the program linker will
|
||||
use the largest size. Implementing common symbol semantics across shared
|
||||
libraries is a touchy subject, somewhat helped by the recent introduction of a
|
||||
type for common symbols as well as a special section index (see the discussion
|
||||
of symbol types below).
|
||||
|
||||
The size of an ELF symbol, other than a common symbol, is the size of the
|
||||
variable or function. This is mainly used for debugging purposes.
|
||||
|
||||
The binding of an elf symbol is global, local, or weak. A global symbol is
|
||||
globally visible. A local symbol is only locally visible (e.g., a static
|
||||
function). Weak symbols come in two flavors. A weak undefined reference is like
|
||||
an ordinary undefined reference, except that it is not an error if a relocation
|
||||
refers to a weak undefined reference symbol which has no defining symbol.
|
||||
Instead, the relocation is computed as though the symbol had the value zero.
|
||||
|
||||
A weak defined symbol is permitted to be linked with a non-weak defined symbol
|
||||
of the same name without causing a multiple definition error. Historically
|
||||
there are two ways for the program linker to handle a weak defined symbol. On
|
||||
SVR4 if the program linker sees a weak defined symbol followed by a non-weak
|
||||
defined symbol with the same name, it will issue a multiple definition error.
|
||||
However, a non-weak defined symbol followed by a weak defined symbol will not
|
||||
cause an error. On Solaris, a weak defined symbol followed by a non-weak
|
||||
defined symbol is handled by causing all references to attach to the non-weak
|
||||
defined symbol, with no error. This difference in behaviour is due to an
|
||||
ambiguity in the ELF ABI which was read differently by different people. The
|
||||
GNU linker follows the Solaris behaviour.
|
||||
|
||||
The type of an ELF symbol is one of the following:
|
||||
|
||||
* `STT_NOTYPE`: no particular type.
|
||||
* `STT_OBJECT`: a data object, such as a variable.
|
||||
* `STT_FUNC`: a function
|
||||
* `STT_SECTION`: a local symbol associated with a section. This type of symbol
|
||||
is used to reduce the number of local symbols required, by changing all
|
||||
relocations against local symbols in a specific section to use the
|
||||
STT_SECTION symbol instead.
|
||||
* `STT_FILE`: a special symbol whose name is the name of the source file which
|
||||
produced the object file.
|
||||
* `STT_COMMON`: a common symbol. This is the same as setting the section index
|
||||
to `SHN_COMMON`, except in a shared object. The program linker will normally
|
||||
have allocated space for the common symbol in the shared object, so it will
|
||||
have a real section index. The `STT_COMMON` type tells the dynamic linker
|
||||
that although the symbol has a regular definition, it is a common symbol.
|
||||
* `STT_TLS`: a symbol in the Thread Local Storage area. I will describe this in
|
||||
more detail some other day.
|
||||
|
||||
ELF symbol visibility was invented to provide more control over which symbols
|
||||
were accessible outside a shared library. The basic idea is that a symbol may
|
||||
be global within a shared library, but local outside the shared library.
|
||||
|
||||
* `STV_DEFAULT`: the usual visibility rules apply: global symbols are visible
|
||||
everywhere.
|
||||
* `STV_INTERNAL`: the symbol is not accessible outside the current executable
|
||||
or shared library.
|
||||
* `STV_HIDDEN`: the symbol is not visible outside the current executable or
|
||||
shared library, but it may be accessed indirectly, probably because some code
|
||||
took its address.
|
||||
* `STV_PROTECTED`: the symbol is visible outside the current executable or
|
||||
shared object, but it may not be overridden. That is, if a protected symbol
|
||||
in a shared library is referenced by other code in the shared library, that
|
||||
other code will always reference the symbol in the shared library, even if
|
||||
the executable defines a symbol with the same name.
|
||||
|
||||
I’ll described symbol versions later.
|
||||
|
||||
More tomorrow.
|
||||
|
|
@ -0,0 +1,127 @@
|
|||
# Linkers part 6
|
||||
|
||||
So many things to talk about. Let’s go back and cover relocations in some more
|
||||
detail, with some examples.
|
||||
|
||||
## Relocations
|
||||
|
||||
As I said back in part 2, a relocation is a computation to perform on the
|
||||
contents. And as I said yesterday, a relocation can also direct the linker to
|
||||
take other actions, like creating a PLT or GOT entry. Let’s take a closer look
|
||||
at the computation.
|
||||
|
||||
In general a relocation has a type, a symbol, an offset into the contents, and
|
||||
an addend. From the linker’s point of view, the contents are simply an
|
||||
uninterpreted series of bytes. A relocation changes those bytes as necessary to
|
||||
produce the correct final executable. For example, consider the C code
|
||||
`g = 0;` where `g` is a global variable. On the i386, the compiler will turn
|
||||
this into an assembly language instruction, which will most likely be
|
||||
`movl $0, g` (for position dependent code–position independent code would
|
||||
loading the address of `g` from the GOT). Now, the `g` in the C code is a
|
||||
global variable, and we all more or less know what that means. The `g` in the
|
||||
assembly code is not that variable. It is a symbol which holds the address of
|
||||
that variable.
|
||||
|
||||
The assembler does not know the address of the global variable `g`, which is
|
||||
another way of saying that the assembler does not know the value of the symbol
|
||||
`g`. It is the linker that is going to pick that address. So the assembler has
|
||||
to tell the linker that it needs to use the address of `g` in this instruction.
|
||||
The way the assembler does this is to create a relocation. We don’t use a
|
||||
separate relocation type for each instruction; instead, each processor will
|
||||
have a natural set of relocation types which are appropriate for the machine
|
||||
architecture. Each type of relocation expresses a specific computation.
|
||||
|
||||
In the i386 case, the assembler will generate these bytes:
|
||||
|
||||
```
|
||||
c7 05 00 00 00 00 00 00 00 00
|
||||
```
|
||||
|
||||
The `c7 05` are the instruction (movl constant to address). The first four `00`
|
||||
bytes are the 32-bit constant 0. The second four `00` bytes are the address.
|
||||
The assembler tells the linker to put the value of the symbol `g` into those
|
||||
four bytes by generating (in this case) a `R_386_32` relocation. For this
|
||||
relocation the symbol will be `g`, the offset will be to the last four bytes of
|
||||
the instruction, the type will be `R_386_32`, and the addend will be 0 (in the
|
||||
case of the i386 the addend is stored in the contents rather than in the
|
||||
relocation itself, but this is a detail). The type `R_386_32` expresses a
|
||||
specific computation, which is: put the 32-bit sum of the value of the symbol
|
||||
and the addend into the offset. Since for the i386 the addend is stored in the
|
||||
contents, this can also be expressed as: add the value of the symbol to the
|
||||
32-bit field at the offset. When the linker performs this computation, the
|
||||
address in the instruction will be the address of the global variable g.
|
||||
Regardless of the details, the important point to note is that the relocation
|
||||
adjusts the contents by applying a specific computation selected by the type.
|
||||
|
||||
An example of a simple case which does use an addend would be
|
||||
|
||||
```c
|
||||
char a[10]; // A global array.
|
||||
char* p = &a[1]; // In a function.
|
||||
```
|
||||
|
||||
The assignment to p will wind up requiring a relocation for the symbol `a`.
|
||||
Here the addend will be 1, so that the resulting instruction references `a + 1`
|
||||
rather than `a + 0`.
|
||||
|
||||
To point out how relocations are processor dependent, let’s consider `g = 0;`
|
||||
on a RISC processor: the PowerPC (in 32-bit mode). In this case, multiple
|
||||
assembly language instructions are required:
|
||||
|
||||
```asm
|
||||
li 1,0 // Set register 1 to 0
|
||||
lis 9,g@ha // Load high-adjusted part of g into register 9
|
||||
stw 1,g@l(9) // Store register 1 to address in register 9 plus low adjusted part g
|
||||
```
|
||||
|
||||
The `lis` instruction loads a value into the upper 16 bits of register 9,
|
||||
setting the lower 16 bits to zero. The `stw` instruction adds a signed 16 bit
|
||||
value to register 9 to form an address, and then stores the value of register 1
|
||||
at that address. The `@ha` part of the operand directs the assembler to
|
||||
generate a `R_PPC_ADDR16_HA` reloc. The `@l` produces a `R_PPC_ADDR16_LO`
|
||||
reloc. The goal of these relocs is to compute the value of the symbol `g` and
|
||||
use it as the store address.
|
||||
|
||||
That is enough information to determine the computations performed by these
|
||||
relocs. The `R_PPC_ADDR16_HA` reloc computes
|
||||
`(SYMBOL >> 16) + ((SYMBOL & 0x8000) ? 1 : 0)`. `The R_PPC_ADDR16_LO` computes
|
||||
`SYMBOL & 0xffff`. The extra computation for `R_PPC_ADDR16_HA` is because the
|
||||
`stw` instruction adds the signed 16-bit value, which means that if the low 16
|
||||
bits appears negative we have to adjust the high 16 bits accordingly. The
|
||||
offsets of the relocations are such that the 16-bit resulting values are stored
|
||||
into the appropriate parts of the machine instructions.
|
||||
|
||||
The specific examples of relocations I’ve discussed here are ELF specific, but
|
||||
the same sorts of relocations occur for any object file format.
|
||||
|
||||
The examples I’ve shown are for relocations which appear in an object file. As
|
||||
discussed in part 4, these types of relocations may also appear in a shared
|
||||
library, if they are copied there by the program linker. In ELF, there are also
|
||||
specific relocation types which never appear in object files but only appear in
|
||||
shared libraries or executables. These are the `JMP_SLOT`, `GLOB_DAT`, and
|
||||
`RELATIVE` relocations discussed earlier. Another type of relocation which only
|
||||
appears in an executable is a `COPY` relocation, which I will discuss later.
|
||||
|
||||
## Position Dependent Shared Libraries
|
||||
|
||||
I realized that in part 4 I forgot to say one of the important reasons that ELF
|
||||
shared libraries use PLT and GOT tables. The idea of a shared library is to
|
||||
permit mapping the same shared library into different processes. This only
|
||||
works at maximum efficiency if the shared library code looks the same in each
|
||||
process. If it does not look the same, then each process will need its own
|
||||
private copy, and the savings in physical memory and sharing will be lost.
|
||||
|
||||
As discussed in part 4, when the dynamic linker loads a shared library which
|
||||
contains position dependent code, it must apply a set of dynamic relocations.
|
||||
Those relocations will change the code in the shared library, and it will no
|
||||
longer be sharable.
|
||||
|
||||
The advantage of the PLT and GOT is that they move the relocations elsewhere,
|
||||
to the PLT and GOT tables themselves. Those tables can then be put into a
|
||||
read-write part of the shared library. This part of the shared library will be
|
||||
much smaller than the code. The PLT and GOT tables will be different in each
|
||||
process using the shared library, but the code will be the same.
|
||||
|
||||
I’ll be taking a vacation for the long weekend. My next post will most likely
|
||||
be on Tuesday.
|
||||
|
|
@ -0,0 +1,176 @@
|
|||
# Linkers part 7
|
||||
|
||||
As we’ve seen, what linkers do is basically quite simple, but the details can
|
||||
get complicated. The complexity is because smart programmers can see small
|
||||
optimizations to speed up their programs a little bit, and somtimes the only
|
||||
place those optimizations can be implemented is the linker. Each such
|
||||
optimizations makes the linker a little more complicated. At the same time, of
|
||||
course, the linker has to run as fast as possible, since nobody wants to sit
|
||||
around waiting for it to finish. Today I’ll talk about a classic small
|
||||
optimization implemented by the linker.
|
||||
|
||||
## Thread Local Storage
|
||||
|
||||
I’ll assume you know what a thread is. It is often useful to have a global
|
||||
variable which can take on a different value in each thread (if you don’t see
|
||||
why this is useful, just trust me on this). That is, the variable is global to
|
||||
the program, but the specific value is local to the thread. If thread A sets
|
||||
the thread local variable to 1, and thread B then sets it to 2, then code
|
||||
running in thread A will continue to see the value 1 for the variable while
|
||||
code running in thread B sees the value 2. In Posix threads this type of
|
||||
variable can be created via `pthread_key_create` and accessed via
|
||||
`pthread_getspecific` and `pthread_setspecific`.
|
||||
|
||||
Those functions work well enough, but making a function call for each access is
|
||||
awkward and inconvenient. It would be more useful if you could just declare a
|
||||
regular global variable and mark it as thread local. That is the idea of Thread
|
||||
Local Storage (TLS), which I believe was invented at Sun. On a system which
|
||||
supports TLS, any global (or static) variable may be annotated with `__thread`.
|
||||
The variable is then thread local.
|
||||
|
||||
Clearly this requires support from the compiler. It also requires support from
|
||||
the program linker and the dynamic linker. For maximum efficiency–and why do
|
||||
this if you aren’t going to get maximum efficiency?–some kernel support is also
|
||||
needed. The design of TLS on ELF systems fully supports shared libraries,
|
||||
including having multiple shared libraries, and the executable itself, use the
|
||||
same name to refer to a single TLS variable. TLS variables can be initialized.
|
||||
Programs can take the address of a TLS variable, and pass the pointers between
|
||||
threads, so the address of a TLS variable is a dynamic value and must be
|
||||
globally unique.
|
||||
|
||||
How is this all implemented? First step: define different storage models for
|
||||
TLS variables.
|
||||
|
||||
* Global Dynamic: Fully general access to TLS variables from an executable or a
|
||||
shared object.
|
||||
* Local Dynamic: Permits access to a variable which is bound locally within the
|
||||
executable or shared object from which it is referenced. This is true for all
|
||||
static TLS variables, for example. It is also true for protected symbols–I
|
||||
described those back in part 5.
|
||||
* Initial Executable: Permits access to a variable which is known to be part of
|
||||
the TLS image of the executable. This is true for all TLS variables defined
|
||||
in the executable itself, and for all TLS variables in shared libraries
|
||||
explicitly linked with the executable. This is not true for accesses from a
|
||||
shared library, nor for accesses to TLS variables defined in shared libraries
|
||||
opened by `dlopen`.
|
||||
* Local Executable: Permits access to TLS variables defined in the executable
|
||||
itself.
|
||||
|
||||
These storage models are defined in decreasing order of flexibility. Now, for
|
||||
efficiency and simplicity, a compiler which supports TLS will permit the
|
||||
developer to specify the appropriate TLS model to use (with gcc, this is done
|
||||
with the `-ftls-model` option, although the Global Dynamic and Local Dynamic
|
||||
models also require using `-fpic`). So, when compiling code which will be in an
|
||||
executable and never be in a shared library, the developer may choose to set
|
||||
the TLS storage model to Initial Executable.
|
||||
|
||||
Of course, in practice, developers often do not know where code will be used.
|
||||
And developers may not be aware of the intricacies of TLS models. The program
|
||||
linker, on the other hand, knows whether it is creating an executable or a
|
||||
shared library, and it knows whether the TLS variable is defined locally. So
|
||||
the program linker gets the job of automatically optimizing references to TLS
|
||||
variables when possible. These references take the form of relocations, and the
|
||||
linker optimizes the references by changing the code in various ways.
|
||||
|
||||
The program linker is also responsible for gathering all TLS variables together
|
||||
into a single TLS segment (I’ll talk more about segments later, for now think
|
||||
of them as a section). The dynamic linker has to group together the TLS
|
||||
segments of the executable and all included shared libraries, resolve the
|
||||
dynamic TLS relocations, and has to build TLS segments dynamically when dlopen
|
||||
is used. The kernel has to make it possible for access to the TLS segments be
|
||||
efficient.
|
||||
|
||||
That was all pretty general. Let’s do an example, again for i386 ELF. There are
|
||||
three different implementations of i386 ELF TLS; I’m going to look at the gnu
|
||||
implementation. Consider this trivial code:
|
||||
|
||||
```asm
|
||||
__thread int i;
|
||||
int foo() { return i; }
|
||||
```
|
||||
|
||||
In global dynamic mode, this generates i386 assembler code like this:
|
||||
|
||||
```asm
|
||||
leal i@TLSGD(,%ebx,1), %eax
|
||||
call ___tls_get_addr@PLT
|
||||
movl (%eax), %eax
|
||||
```
|
||||
|
||||
Recall from part 4 that `%ebx` holds the address of the GOT table. The first
|
||||
instruction will have a `R_386_TLS_GD` relocation for the variable `i`; the
|
||||
relocation will apply to the offset of the leal instruction. When the program
|
||||
linker sees this relocation, it will create two consecutive entries in the GOT
|
||||
table for the TLS variable `i`. The first one will get a `R_386_TLS_DTPMOD32`
|
||||
dynamic relocation, and the second will get a `R_386_TLS_DTPOFF32` dynamic
|
||||
relocation. The dynamic linker will set the `DTPMOD32` GOT entry to hold the
|
||||
module ID of the object which defines the variable. The module ID is an index
|
||||
within the dynamic linker’s tables which identifies the executable or a
|
||||
specific shared library. The dynamic linker will set the `DTPOFF32` GOT entry
|
||||
to the offset within the TLS segment for that module. The `__tls_get_addr`
|
||||
function will use those values to compute the address (this function also takes
|
||||
care of lazy allocation of TLS variables, which is a further optimization
|
||||
specific to the dynamic linker). Note that `__tls_get_addr` is actually
|
||||
implemented by the dynamic linker itself; it follows that global dynamic TLS
|
||||
variables are not supported (and not necessary) in statically linked
|
||||
executables.
|
||||
|
||||
At this point you are probably wondering what is so inefficient
|
||||
about `pthread_getspecific`. The real advantage of TLS shows when you see what
|
||||
the program linker can do. The `leal; call` sequence shown above is canonical:
|
||||
the compiler will always generate the same sequence to access a TLS variable in
|
||||
global dynamic mode. The program linker takes advantage of that fact. If the
|
||||
program linker sees that the code shown above is going into an executable, it
|
||||
knows that the access does not have to be treated as global dynamic; it can be
|
||||
treated as initial executable. The program linker will actually rewrite the
|
||||
code to look like this:
|
||||
|
||||
```asm
|
||||
movl %gs:0, %eax
|
||||
subl $i@GOTTPOFF(%ebx), %eax
|
||||
```
|
||||
|
||||
Here we see that the TLS system has coopted the `%gs` segment register, with
|
||||
cooperation from the operating system, to point to the TLS segment of the
|
||||
executable. For each processor which supports TLS, some such efficiency hack is
|
||||
made. Since the program linker is building the executable, it builds the TLS
|
||||
segment, and knows the offset of `i` in the segment. The `GOTTPOFF` is not a
|
||||
real relocation; it is created and then resolved within the program linker. It
|
||||
is, of course, the offset from the GOT table to the address of `i` in the TLS
|
||||
segment. The `movl (%eax), %eax` from the original sequence remains to actually
|
||||
load the value of the variable.
|
||||
|
||||
Actually, that is what would happen if `i` were not defined in the executable
|
||||
itself. In the example I showed, `i` is defined in the executable, so the
|
||||
program linker can actually go from a global dynamic access all the way to a
|
||||
local executable access. That looks like this:
|
||||
|
||||
```asm
|
||||
movl %gs:0,%eax
|
||||
subl $i@TPOFF,%eax
|
||||
```
|
||||
|
||||
Here `i@TPOFF` is simply the known offset of `i` within the TLS segment. I’m
|
||||
not going to go into why this uses `subl` rather than `addl`; suffice it to say
|
||||
that this is another efficiency hack in the dynamic linker.
|
||||
|
||||
If you followed all that, you’ll see that when an executable accesses a TLS
|
||||
variable which is defined in that executable, it requires two instructions to
|
||||
compute the address, typically followed by another one to actually load or
|
||||
store the value. That is significantly more efficient than calling
|
||||
`pthread_getspecific`. Admittedly, when a shared library accesses a TLS
|
||||
variable, the result is not much better than `pthread_getspecific`, but it
|
||||
shouldn’t be any worse, either. And the code using `__thread` is much easier to
|
||||
write and to read.
|
||||
|
||||
That was a real whirlwind tour. There are three separate but related TLS
|
||||
implementations on i386 (known as sun, gnu, and gnu2), and 23 different
|
||||
relocation types are defined. I’m certainly not going to try to describe all
|
||||
the details; I don’t know them all in any case. They all exist in the name of
|
||||
efficient access to the TLS variables for a given storage model.
|
||||
|
||||
Is TLS worth the additional complexity in the program linker and the dynamic
|
||||
linker? Since those tools are used for every program, and since the C standard
|
||||
global variable `errno` in particular can be implemented using TLS, the answer
|
||||
is most likely yes.
|
||||
|
|
@ -0,0 +1,193 @@
|
|||
# Linkers part 8
|
||||
|
||||
## ELF Segments
|
||||
|
||||
Earlier I said that executable file formats were normally the same as object
|
||||
file formats. That is true for ELF, but with a twist. In ELF, object files are
|
||||
composed of sections: all the data in the file is accessed via the section
|
||||
table. Executables and shared libraries normally contain a section table, which
|
||||
is used by programs like `nm`. But the operating system and the dynamic linker
|
||||
do not use the section table. Instead, they use the segment table, which
|
||||
provides an alternative view of the file.
|
||||
|
||||
All the contents of an ELF executable or shared library which are to be loaded
|
||||
into memory are contained within a segment (an object file does not have
|
||||
segments). A segment has a type, some flags, a file offset, a virtual address,
|
||||
a physical address, a file size, a memory size, and an alignment. The file
|
||||
offset points to a contiguous set of bytes which are the contents of the
|
||||
segment, the bytes to load into memory. When the operating system or the
|
||||
dynamic linker loads a file, it will do so by walking through the segments and
|
||||
loading them into memory (typically by using the mmap system call). All the
|
||||
information needed by the dynamic linker–the dynamic relocations, the dynamic
|
||||
symbol table, etc.–are accessed via information stored in special segments.
|
||||
|
||||
Although an ELF executable or shared library does not, strictly speaking,
|
||||
require any sections, they normally do have them. The contents of a loadable
|
||||
section will fall entirely within a single segment.
|
||||
|
||||
The program linker reads sections from the input object files. It sorts and
|
||||
concatenates them into sections in the output file. It maps all the loadable
|
||||
sections into segments in the output file. It lays out the section contents in
|
||||
the output file segments respecting alignment and access requirements, so that
|
||||
the segments may be mapped directly into memory. The sections are mapped to
|
||||
segments based on the access requirements: normally all the read-only sections
|
||||
are mapped to one segment and all the writable sections are mapped to another
|
||||
segment. The address of the latter segment will be set so that it starts on a
|
||||
separate page in memory, permitting `mmap` to set different permissions on the
|
||||
mapped pages.
|
||||
|
||||
The segment flags are a bitmask which define access requirements. The defined
|
||||
flags are `PF_R`, `PF_W`, and `PF_X`, which mean, respectively, that the
|
||||
contents must be made readable, writable, or executable.
|
||||
|
||||
The segment virtual address is the memory address at which the segment contents
|
||||
are loaded at runtime. The physical address is officially undefined, but is
|
||||
often used as the load address when using a system which does not use virtual
|
||||
memory. The file size is the size of the contents in the file. The memory size
|
||||
may be larger than the file size when the segment contains uninitialized data;
|
||||
the extra bytes will be filled with zeroes. The alignment of the segment is
|
||||
mainly informative, as the address is already specified.
|
||||
|
||||
The ELF segment types are as follows:
|
||||
|
||||
* `PT_NULL`: A null entry in the segment table, which is ignored.
|
||||
* `PT_LOAD`: A loadable entry in the segment table. The operating system or
|
||||
dynamic linker load all segments of this type. All other segments with
|
||||
contents will have their contents contained completely within a `PT_LOAD`
|
||||
segment.
|
||||
* `PT_DYNAMIC`: The dynamic segment. This points to a series of dynamic tags
|
||||
which the dynamic linker uses to find the dynamic symbol table, dynamic
|
||||
relocations, and other information that it needs.
|
||||
* `PT_INTERP`: The interpreter segment. This appears in an executable. The
|
||||
operating system uses it to find the name of the dynamic linker to run for
|
||||
the executable. Normally all executables will have the same interpreter name,
|
||||
but on some operating systems different interpreters are used in different
|
||||
emulation modes.
|
||||
* `PT_NOTE`: A note segment. This contains system dependent note information
|
||||
which may be used by the operating system or the dynamic linker. On
|
||||
GNU/Linux systems shared libraries often have a ABI tag note which may be
|
||||
used to specify the minimum version of the kernel which is required for the
|
||||
shared library. The dynamic linker uses this when selecting among different
|
||||
shared libraries.
|
||||
* `PT_SHLIB`: This is not used as far as I know.
|
||||
* `PT_PHDR`: This indicates the address and size of the segment table. This is
|
||||
not too useful in practice as you have to have already found the segment
|
||||
table before you can find this segment.
|
||||
* `PT_TLS`: The TLS segment. This holds the initial values for TLS variables.
|
||||
* `PT_GNU_EH_FRAME` (`0x6474e550`): A GNU extension used to hold a sorted table
|
||||
of unwind information. This table is built by the GNU program linker. It is
|
||||
used by gcc’s support library to quickly find the appropriate handler for an
|
||||
exception, without requiring exception frames to be registered when the
|
||||
program starts.
|
||||
* `PT_GNU_STACK` (`0x6474e551`): A GNU extension used to indicate whether the
|
||||
stack should be executable. This segment has no contents. The dynamic linker
|
||||
sets the permission of the stack in memory to the permissions of this segment.
|
||||
* `PT_GNU_RELRO` (`0x6474e552`): A GNU extension which tells the dynamic linker
|
||||
to set the given address and size to be read-only after applying dynamic
|
||||
relocations. This is used for const variables which require dynamic
|
||||
relocations.
|
||||
|
||||
## ELF Sections
|
||||
|
||||
Now that we’ve done segments, lets take a quick look at the details of ELF
|
||||
sections. ELF sections are more complicated than segments, in that there are
|
||||
more types of sections. Every ELF object file, and most ELF executables and
|
||||
shared libraries, have a table of sections. The first entry in the table,
|
||||
section 0, is always a null section.
|
||||
|
||||
ELF sections have several fields.
|
||||
|
||||
* Name.
|
||||
* Type. I discuss section types below.
|
||||
* Flags. I discuss section flags below.
|
||||
* Address. This is the address of the section. In an object file this is
|
||||
normally zero. In an executable or shared library it is the virtual address.
|
||||
Since executables are normally accessed via segments, this is essentially
|
||||
documentation.
|
||||
* File offset. This is the offset of the contents within the file.
|
||||
* Size. The size of the section.
|
||||
* Link. Depending on the section type, this may hold the index of another
|
||||
section in the section table.
|
||||
* Info. The meaning of this field depends on the section type.
|
||||
* Address alignment. This is the required alignment of the section. The program
|
||||
linker uses this when laying out the section in memory.
|
||||
* Entry size. For sections which hold an array of data, this is the size of one
|
||||
data element.
|
||||
|
||||
These are the types of ELF sections which the program linker may see.
|
||||
|
||||
* `SHT_NULL`: A null section. Sections with this type may be ignored.
|
||||
* `SHT_PROGBITS`: A section holding bits of the program. This is an ordinary
|
||||
section with contents.
|
||||
* `SHT_SYMTAB`: The symbol table. This section actually holds the symbol table
|
||||
itself. The section contents are an array of ELF symbol structures.
|
||||
* `SHT_STRTAB`: A string table. This type of section holds null-terminated
|
||||
strings. Sections of this type are used for the names of the symbols and the
|
||||
names of the sections themselves.
|
||||
* `SHT_RELA`: A relocation table. The link field holds the index of the section
|
||||
to which these relocations apply. These relocations include addends.
|
||||
* `SHT_HASH`: A hash table used by the dynamic linker to speed symbol lookup.
|
||||
* `SHT_DYNAMIC`: The dynamic tags used by the dynamic linker. Normally the
|
||||
`PT_DYNAMIC` segment and the `SHT_DYNAMIC` section will point to the same
|
||||
contents.
|
||||
* `SHT_NOTE`: A note section. This is used in system dependent ways. A loadable
|
||||
`SHT_NOTE` section will become a `PT_NOTE` segment.
|
||||
* `SHT_NOBITS`: A section which takes up memory space but has no associated
|
||||
contents. This is used for zero-initialized data.
|
||||
* `SHT_REL`: A relocation table, like `SHT_RELA` but the relocations have no
|
||||
addends.
|
||||
* `SHT_SHLIB`: This is not used as far as I know.
|
||||
* `SHT_DYNSYM`: The dynamic symbol table. Normally the `DT_SYMTAB` dynamic tag
|
||||
will point to the same contents as this section (I haven’t discussed dynamic
|
||||
tags yet, though).
|
||||
* `SHT_INIT_ARRAY`: This section holds a table of function addresses which
|
||||
should each be called at program startup time, or, for a shared library, when
|
||||
the library is opened by `dlopen`.
|
||||
* `SHT_FINI_ARRAY`: Like `SHT_INIT_ARRAY`, but called at program exit time or
|
||||
`dlclose` time.
|
||||
* `SHT_PREINIT_ARRAY`: Like `SHT_INIT_ARRAY`, but called before any shared
|
||||
libraries are initialized. Normally shared libraries initializers are run
|
||||
before the executable initializers. This section type may only be linked into
|
||||
an executable, not into a shared library.
|
||||
* `SHT_GROUP`: This is used to group related sections together, so that the
|
||||
program linker may discard them as a unit when appropriate. Sections of this
|
||||
type may only appear in object files. The contents of this type of section
|
||||
are a flag word followed by a series of section indices.
|
||||
* `SHT_SYMTAB_SHNDX`: ELF symbol table entries only provide a 16-bit field for
|
||||
the section index. For a file with more than 65536 sections, a section of
|
||||
this type is created. It holds one 32-bit word for each symbol. If a symbol’s
|
||||
section index is `SHN_XINDEX`, the real section index may be found by looking
|
||||
in the `SHT_SYMTAB_SHNDX` section.
|
||||
* `SHT_GNU_LIBLIST` (`0x6ffffff7`): A GNU extension used by the prelinker to
|
||||
hold a list of libraries found by the prelinker.
|
||||
* `SHT_GNU_verdef` (`0x6ffffffd`): A Sun and GNU extension used to hold version
|
||||
definitions (I’ll take about symbol versions at some point).
|
||||
* `SHT_GNU_verneed` (`0x6ffffffe`): A Sun and GNU extension used to hold
|
||||
versions required from other shared libraries.
|
||||
* `SHT_GNU_versym` (`0x6fffffff`): A Sun and GNU extension used to hold the
|
||||
versions for each symbol.
|
||||
|
||||
These are the types of section flags.
|
||||
|
||||
* `SHF_WRITE`: Section contains writable data.
|
||||
* `SHF_ALLOC`: Section contains data which should be part of the loaded program
|
||||
image. For example, this would normally be set for a `SHT_PROGBITS` section
|
||||
and not set for a `SHT_SYMTAB` section.
|
||||
* `SHF_EXECINSTR`: Section contains executable instructions.
|
||||
* `SHF_MERGE`: Section contains constants which the program linker may merge
|
||||
together to save space. The compiler can use this type of section for
|
||||
read-only data whose address is unimportant.
|
||||
* `SHF_STRINGS`: In conjunction with `SHF_MERGE`, this means that the section
|
||||
holds null terminated string constants which may be merged.
|
||||
* `SHF_INFO_LINK`: This flag indicates that the info field in the section holds
|
||||
a section index.
|
||||
* `SHF_LINK_ORDER`: This flag tells the program linker that when it combines
|
||||
sections, this section must appear in the same relative order as the section
|
||||
in the link field. This can be used to ensure that address tables are built
|
||||
in the expected order.
|
||||
* `SHF_OS_NONCONFORMING`: If the program linker sees a section with this flag,
|
||||
and does not understand the type or all other flags, then it must issue an
|
||||
error.
|
||||
* `SHF_GROUP`: This section appears in a group (see `SHT_GROUP`, above).
|
||||
* `SHF_TLS`: This section holds TLS data.
|
||||
|
|
@ -0,0 +1,104 @@
|
|||
# Linkers part 9
|
||||
|
||||
## Symbol Versions
|
||||
|
||||
A shared library provides an API. Since executables are built with a specific
|
||||
set of header files and linked against a specific instance of the shared
|
||||
library, it also provides an ABI. It is desirable to be able to update the
|
||||
shared library independently of the executable. This permits fixing bugs in the
|
||||
shared library, and it also permits the shared library and the executable to be
|
||||
distributed separately. Sometimes an update to the shared library requires
|
||||
changing the API, and sometimes changing the API requires changing the ABI.
|
||||
When the ABI of a shared library changes, it is no longer possible to update
|
||||
the shared library without updating the executable. This is unfortunate.
|
||||
|
||||
For example, consider the system C library and the `stat` function. When file
|
||||
systems were upgraded to support 64-bit file offsets, it became necessary to
|
||||
change the type of some of the fields in the stat struct. This is a change in
|
||||
the ABI of `stat`. New versions of the system library should provide a `stat`
|
||||
which returns 64-bit values. But old existing executables call `stat` expecting
|
||||
32-bit values. This could be addressed by using complicated macros in the
|
||||
system header files. But there is a better way.
|
||||
|
||||
The better way is symbol versions, which were introduced at Sun and extended by
|
||||
the GNU tools. Every shared library may define a set of symbol versions, and
|
||||
assign specific versions to each defined symbol. The versions and symbol
|
||||
assignments are done by a script passed to the program linker when creating the
|
||||
shared library.
|
||||
|
||||
When an executable or shared library A is linked against another shared library
|
||||
B, and A refers to a symbol S defined in B with a specific version, the
|
||||
undefined dynamic symbol reference S in A is given the version of the symbol S
|
||||
in B. When the dynamic linker sees that A refers to a specific version of S, it
|
||||
will link it to that specific version in B. If B later introduces a new version
|
||||
of S, this will not affect A, as long as B continues to provide the old version
|
||||
of S.
|
||||
|
||||
For example, when `stat` changes, the C library would provide two versions of
|
||||
stat, one with the old version (e.g., `LIBC_1.0`), and one with the new version
|
||||
(`LIBC_2.0`). The new version of `stat` would be marked as the default–the
|
||||
program linker would use it to satisfy references to stat in object files.
|
||||
Executables linked against the old version would require the `LIBC_1.0` version
|
||||
of `stat`, and would therefore continue to work. Note that it is even possible
|
||||
for both versions of `stat` to be used in a single program, accessed from
|
||||
different shared libraries.
|
||||
|
||||
As you can see, the version effectively is part of the name of the symbol. The
|
||||
biggest difference is that a shared library can define a specific version which
|
||||
is used to satisfy an unversioned reference.
|
||||
|
||||
Versions can also be used in an object file (this is a GNU extension to the
|
||||
original Sun implementation). This is useful for specifying versions without
|
||||
requiring a version script. When a symbol name containts the `@` character, the
|
||||
string before the `@` is the name of the symbol, and the string after the `@`
|
||||
is the version. If there are two consecutive `@` characters, then this is the
|
||||
default version.
|
||||
|
||||
## Relaxation
|
||||
|
||||
Generally the program linker does not change the contents other than applying
|
||||
relocations. However, there are some optimizations which the program linker can
|
||||
perform at link time. One of them is relaxation.
|
||||
|
||||
Relaxation is inherently processor specific. It consists of optimizing code
|
||||
sequences which can become smaller or more efficient when final addresses are
|
||||
known. The most common type of relaxation is for `call` instructions. A
|
||||
processor like the m68k supports different PC relative `call` instructions: one
|
||||
with a 16-bit offset, and one with a 32-bit offset. When calling a function
|
||||
which is within range of the 16-bit offset, it is more efficient to use the
|
||||
shorter instruction. The optimization of shrinking these instructions at link
|
||||
time is known as relaxation.
|
||||
|
||||
Relaxation is applied based on relocation entries. The linker looks for
|
||||
relocations which may be relaxed, and checks whether they are in range. If they
|
||||
are, the linker applies the relaxation, probably shrinking the size of the
|
||||
contents. The relaxation can normally only be done when the linker recognizes
|
||||
the instruction being relocated. Applying a relaxation may in turn bring other
|
||||
relocations within range, so relaxation is typically done in a loop until there
|
||||
are no more opportunities.
|
||||
|
||||
When the linker relaxes a relocation in the middle of a contents, it may need
|
||||
to adjust any PC relative references which cross the point of the relaxation.
|
||||
Therefore, the assembler needs to generate relocation entries for all PC
|
||||
relative references. When not relaxing, these relocations may not be required,
|
||||
as a PC relative reference within a single contents will be valid whereever the
|
||||
contents winds up. When relaxing, though, the linker needs to look through all
|
||||
the other relocations that apply to the contents, and adjust PC relatives one
|
||||
where appropriate. This adjustment will simply consist of recomputing the PC
|
||||
relative offset.
|
||||
|
||||
Of course it is also possible to apply relaxations which do not change the size
|
||||
of the contents. For example, on the MIPS the position independent calling
|
||||
sequence is normally to load the address of the function into the `$25`
|
||||
register and then to do an indirect call through the register. When the target
|
||||
of the call is within the 18-bit range of the branch-and-call instruction, it
|
||||
is normally more efficient to use branch-and-call, since then the processor
|
||||
does not have to wait for the load of `$25` to complete before starting the
|
||||
call. This relaxation changes the instruction sequence without changing the
|
||||
size.
|
||||
|
||||
More tomorrow. I apologize for the haphazard arrangement of these linker notes.
|
||||
I’m just writing about ideas as I think of them, rather than being organized
|
||||
about that. If I do collect these notes into an essay, I’ll try to make them
|
||||
more structured.
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
# Piece of PIE
|
||||
|
||||
Modern ELF systems can randomize the address at which shared libraries are
|
||||
loaded. This is generally referred to as Address Space Layout Randomization, or
|
||||
ASLR. Shared libraries are always position independent, which means that they
|
||||
can be loaded at any address. Randomizing the load address makes it slightly
|
||||
harder for attackers of a running program to exploit buffer overflows or
|
||||
similar problems, because they have no fixed addresses that they can rely on.
|
||||
ASLR is part of defense in depth: it does not by itself prevent any attacks,
|
||||
but it makes it slightly more difficult for attackers to exploit certain kinds
|
||||
of programming errors in a useful way beyond simply crashing the program.
|
||||
|
||||
Although it is straightforward to randomize the load address of a shared
|
||||
library, an ELF executable is normally linked to run at a fixed address that
|
||||
can not be changed. This means that attackers have a set of fixed addresses
|
||||
they can rely on. Permitting the kernel to randomize the address of the
|
||||
executable itself is done by generating a Position Independent Executable, or
|
||||
PIE.
|
||||
|
||||
It turns out to be quite simple to create a PIE: a PIE is simply an executable
|
||||
shared library. To make a shared library executable you just need to give it a
|
||||
`PT_INTERP` segment and appropriate startup code. The startup code can be the
|
||||
same as the usual executable startup code, though of course it must be compiled
|
||||
to be position independent.
|
||||
|
||||
When compiling code to go into a shared library, you use the `-fpic` option.
|
||||
When compiling code to go into a PIE, you use the `-fpie` option. Since a PIE
|
||||
is just a shared library, these options are almost exactly the same. The only
|
||||
difference is that since `-fpie` implies that you are building the main
|
||||
executable, there is no need to support symbol interposition for defined
|
||||
symbols. In a shared library, if function `f1` calls `f2`, and `f2` is globally
|
||||
visible, the code has to consider the possibility that `f2` will be interposed.
|
||||
Thus, the call must go through the PLT. In a PIE, `f2` can not be interposed,
|
||||
so the call may be made directly, though of course still in a position
|
||||
independent manner. Similarly, if the processor can do PC-relative loads and
|
||||
stores, all global variables can be accessed directly rather than going through
|
||||
the GOT.
|
||||
|
||||
Other than that ability to avoid the PLT and GOT in some cases, a PIE is really
|
||||
just a shared library. The dynamic linker will ask the kernel to map it at a
|
||||
random address and will then relocate it as usual.
|
||||
|
||||
This does imply that a PIE must be dynamically linked, in the sense of using
|
||||
the dynamic linker. Since the dynamic linker and the C library are closely
|
||||
intertwined, linking the PIE statically with the C library is unlikely to work
|
||||
in general. It is possible to design a statically linked PIE, in which the
|
||||
program relocates itself at startup time. The dynamic linker itself does this.
|
||||
However, there is no general mechanism for this at present.
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
# Protected symbols
|
||||
|
||||
Now for something really controversial: what’s wrong with protected symbols?
|
||||
|
||||
In an ELF shared library, an ordinary global symbol may be overridden if a
|
||||
symbol of the same name is defined in the executable or in a shared library
|
||||
which appears earlier in the runtime search path. This is called symbol
|
||||
interposition. It is often used with functions such as `malloc`. A shared
|
||||
library can define `malloc` and it can have code which calls `malloc`. If the
|
||||
executable linked with the shared library defines `malloc` itself, then the
|
||||
version in the executable will be used rather than the version in the shared
|
||||
library. This permits the executable to control the memory allocation done by
|
||||
the shared library, perhaps for debugging or logging purposes. In this regard,
|
||||
shared libraries act much as static archives do.
|
||||
|
||||
This has a few consequences. One of them is that within a shared library, all
|
||||
references to a global symbol must use the GOT and PLT, to make the overriding
|
||||
possible. That means that all function calls and variable accesses are slightly
|
||||
slower. Also, some compiler optimizations are forbidden: the compiler can not
|
||||
inline a call to a global symbol, since that symbol might be overridden at run
|
||||
time.
|
||||
|
||||
When building a shared library, you can provide a version script which
|
||||
indicates that some symbols are actually not global. That can eliminate the GOT
|
||||
and PLT accesses, but it does not permit the compiler optimizations, and you do
|
||||
have to write that version script and keep it up to date.
|
||||
|
||||
When compiling code that goes into a shared library, you can set the visibility
|
||||
of symbols. You can use hidden visibility, which means that the symbol is not
|
||||
visible outside the shared library. You can use internal visibility, which is a
|
||||
lot like hidden—I’ll skip the difference here. Or you can use protected
|
||||
visibility. Protected visibility means that the symbol is visible outside of
|
||||
the shared library, and can be accessed as usual. However, all references from
|
||||
within the shared library will use the definition in the shared library. In
|
||||
other words, the symbol acts more or less as usual, but it can not be
|
||||
overridden. This means that accesses to the symbol avoid the GOT and PLT, and
|
||||
it permits compiler optimizations.
|
||||
|
||||
So, what’s wrong with them? It turns out that protected symbols are slower at
|
||||
dynamic link time, which means that programs which use the shared library start
|
||||
up slower. This happens because of the C rule that two pointers to the same
|
||||
function must compare as equal. Since protected symbols are globally visible,
|
||||
you can get a pointer to a protected function in the main executable. You can
|
||||
also get a pointer to that same function in the shared library, of course.
|
||||
Those pointers have to be equal, or the C rule will break.
|
||||
|
||||
As noted, the access to the function in the shared library will not use the GOT
|
||||
or PLT. The access in the main executable obviously will use the PLT. How can
|
||||
we make those function pointers equal? We can’t. The executable will have a
|
||||
direct reference to the PLT. The shared library will have a direct reference to
|
||||
the function itself. In neither case will there be a relocation for the
|
||||
reference. So there is no way to make the results equal. (This can work for
|
||||
some targets, but not for ones with simple function references like the x86
|
||||
targets.)
|
||||
|
||||
So, I must have lied. The lie was that there is a case where you need to use
|
||||
the GOT for a protected symbol: when compiling position independent code for a
|
||||
shared library, and taking the address of a protected function, you need to use
|
||||
the GOT. Unfortunately, gcc for the x86_64 target, surely the most widely used
|
||||
gcc target today, gets this wrong: http://gcc.gnu.org/PR19520. This generally
|
||||
reveals itself as an error report when you go to create a shared library:
|
||||
relocation R_X86_64_PC32 against protected symbol `NAME` can not be used when
|
||||
making a shared object.
|
||||
|
||||
In any case, when the compiler gets it right, the dynamic linker has to fill in
|
||||
that GOT entry. In order to make the function pointers compare as equal, it has
|
||||
to fill in the entry with the address of the PLT in the executable (or the
|
||||
earlier shared library). But remember, this is a protected symbol, and
|
||||
protected symbols don’t support symbol interposition. So the dynamic linker
|
||||
must only use the PLT of the executable if the reference in the executable
|
||||
refers to the definition in the shared library. That means that when the
|
||||
dynamic linker sees a reloc against a protected symbol in a shared library, it
|
||||
has to do another walk through the executable and earlier shared libraries to
|
||||
see if any of them have a definition for the symbol, in which case the GOT
|
||||
entry must not be set to that earlier PLT entry but must instead be set to the
|
||||
address of the symbol in the shared library itself. This check has to be done
|
||||
for every symbol in the shared library.
|
||||
|
||||
Those extra symbol resolution passes means a slow down for every program which
|
||||
uses the shared library, and that is what is wrong with protected symbols.
|
||||
|
||||
So how do you get the compiler and linker speedups available by avoiding symbol
|
||||
interpositioning? Unfortunately, you have to give your symbols hidden
|
||||
visibility, which means that they can not be accessed from other modules.
|
||||
Assuming you do want them to be accessed, you need to define symbol aliases for
|
||||
the ones which should be publicly visible. That means that you need to use
|
||||
different names for the hidden symbols. This is awkward at best. Unfortunately
|
||||
I have nothing better to offer. ELF is designed to support symbol
|
||||
interpositioning, and there is no very good way to avoid that without causing
|
||||
other consequences.
|
||||
|
|
@ -0,0 +1,120 @@
|
|||
# Version Scripts
|
||||
|
||||
I recently spent some time sorting through linker version script issues, so I’m
|
||||
going to document what I discovered.
|
||||
|
||||
Linker symbol versioning was invented at Sun. The Solaris linker lets you use a
|
||||
version script when you create a shared library. This script assigns versions
|
||||
to specific named symbols, and defines a version hierarchy. When an executable
|
||||
is linked against the shared library, the versions that it uses are recorded in
|
||||
the executable. If you later try to dynamically link the executable with a
|
||||
shared library which does not provide the required versions, you get a sensible
|
||||
error message.
|
||||
|
||||
Sun’s scheme (as I understand it) only permits you to add new versions and new
|
||||
symbols. Once a symbol has been defined at a specific version, you can not
|
||||
change that in later releases. if you change the behaviour of a symbol, you
|
||||
don’t change the version of the symbol itself, instead you add a new version to
|
||||
the library even if it does not define any symbols. That is sufficient to
|
||||
ensure that an executable will not be dynamically linked against a version of
|
||||
the shared library which is too old.
|
||||
|
||||
Eric Youngdale and Ulrich Drepper introduced a more sophisticated symbol
|
||||
versioning scheme in the GNU linker and the GNU/Linux dynamic linker. The GNU
|
||||
linker permits symbols to have multiple versions, of which only one is the
|
||||
default. These versions are specified in the object files linked together to
|
||||
form the shared library. The assembler `.symver` directive is used to assign a
|
||||
version to a symbol (the version is simply encoded in the name of the symbol).
|
||||
This scheme permits using symbol versioning to actually change the behaviour of
|
||||
a symbol; older executables will continue to use the old version. This also
|
||||
permits deleting symbols, by removing the default version. The older versions
|
||||
of the symbol remain but are inaccessible.
|
||||
|
||||
That is all fine. The problems come in with the extensions to the version
|
||||
script language. First, the GNU linker permits wildcards in version scripts.
|
||||
Second, the GNU linker permits symbols to match against demangled names, again
|
||||
typically using wildcards. Third, the GNU linker permits the version script to
|
||||
hide symbols which have explicit versions in input object files.
|
||||
|
||||
Every symbol can only have one version. When the linker asks for the version of
|
||||
a symbol, there can only be one answer. The support for wildcards and matching
|
||||
of demangled names in the GNU linker script means that there may not be a
|
||||
unique answer for the version to use for a given name. The fact that the GNU
|
||||
linker permits version scripts to hide symbols with explicit versions means
|
||||
that in some cases you absolutely must list a symbol two times in a version
|
||||
script (because you might have a `local: *;` entry which must not match your
|
||||
symbol with an old version). This potential confusion means that using linker
|
||||
scripts correctly with wildcards requires a clear understanding of exactly how
|
||||
the linker parses a version script.
|
||||
|
||||
Unfortunately, this was never documented. Until now. Here are the rules which
|
||||
the GNU linker uses to parse version scripts, as of 2010-01-11.
|
||||
|
||||
The GNU linker walks through the version tags in the order in which they appear
|
||||
in the version script. For each tag, it first walks through the global patterns
|
||||
for that tag, then the local patterns. When looking at a single pattern, it
|
||||
first applies any language specific demangling as specified for the pattern,
|
||||
and then matches the resulting symbol name to the pattern. If it finds an exact
|
||||
match for a literal pattern (a pattern enclosed in quotes or with no wildcard
|
||||
characters), then that is the match that it uses. If finds a match with a
|
||||
wildcard pattern, then it saves it and continues searching. Wildcard patterns
|
||||
that are exactly “*” are saved separately.
|
||||
|
||||
If no exact match with a literal pattern is ever found, then if a wildcard
|
||||
match with a global pattern was found it is used, otherwise if a wildcard match
|
||||
with a local pattern was found it is used.
|
||||
|
||||
This is the result:
|
||||
|
||||
* If there is an exact match, then we use the first tag in the version script
|
||||
where it matches.
|
||||
* If the exact match in that tag is global, it is used.
|
||||
* Otherwise the exact match in that tag is local, and is used.
|
||||
* Otherwise, if there is any match with a global wildcard pattern:
|
||||
* If there is any match with a wildcard pattern which is not `*`, then we use
|
||||
the tag in which the last such pattern appears.
|
||||
* Otherwise, we matched `*`. If there is no match with a local wildcard
|
||||
pattern which is not `*`, then we use the last match with a global `*`.
|
||||
Otherwise, continue.
|
||||
* Otherwise, if there is any match with a local wildcard pattern:
|
||||
* If there is any match with a wildcard pattern which is not `*`, then we use
|
||||
the tag in which the last such pattern appears.
|
||||
* Otherwise, we matched `*`, and we use the tag in which the last such match
|
||||
occurred.
|
||||
|
||||
As mentioned above, there is an additional wrinkle. When the GNU linker finds a
|
||||
symbol with a version defined in an object file due to a `.symver` directive, it
|
||||
looks up that symbol name in that version tag. If it finds it, it matches the
|
||||
symbol name against the patterns for that version. If there is no match with a
|
||||
global pattern, but there is a match with a local pattern, then the GNU linker
|
||||
marks the symbol as local.
|
||||
|
||||
I want gold to be compatible, but I also want gold to be efficient. I’ve
|
||||
introduced a hash table in gold to do fast lookups for exact matches. That
|
||||
makes it impossible for gold to follow the exact rules when matching demangled
|
||||
names. Currently gold does not do the final lookup to see if a symbol with an
|
||||
explicit version should be forced local; I don’t understand why that is useful.
|
||||
It is possible that I will be forced to add that to gold at some later date.
|
||||
|
||||
Here are the current rules for gold:
|
||||
|
||||
* If there is an exact match for the mangled name, we use it.
|
||||
* If there is more than one exact match, we give a warning, and we use the
|
||||
first tag in the script which matches.
|
||||
* If a symbol has an exact match as both global and local for the same
|
||||
version tag, we give an error.
|
||||
* Otherwise, we look for an extern C++ or an extern Java exact match. If we
|
||||
find an exact match, we use it.
|
||||
* If there is more than one exact match, we give a warning, and we use the
|
||||
first tag in the script which matches.
|
||||
* If a symbol has an exact match as both global and local for the same
|
||||
version tag, we give an error.
|
||||
* Otherwise, we look through the wildcard patterns, ignoring `*` patterns. We
|
||||
look through the version tags in reverse order. For each version tag, we look
|
||||
through the global patterns and then the local patterns. We use the first
|
||||
match we find (i.e., the last matching version tag in the file).
|
||||
* Otherwise, we use the `*` pattern if there is one. We give a warning if there
|
||||
are multiple `*` patterns.
|
||||
|
||||
I hope for your sake that this information never actually matters to you.
|
||||
|
Loading…
Reference in New Issue