Tuesday, February 19, 2013

Linking and Loading on Windows


I had earlier jotted down a brief description of compilation, linking and loading on recent versions of Linux here. It's worth to see briefly what the situation is on Windows, in comparison. This article presupposes that you have either read this or the source materials listed there.

PE Format
As against to ELF format, windows uses PE format (for both programs and DLLs). This format has evolved since Win 95 days. PE and ELF essentially have a large feature set in common.

There are separate import and export sections (.edata and .idata) in DLL files. This section is used for symbol look up. However on Linux, information on all non-hidden symbols are available in a single symbols section.


Preferred Load Address
Windows compilers generate binaries assuming a certain load address. This is not a problem for an exe as the entire virtual address space is available at the time of exe load. But for a DLL this is not the case and the preferred load address may not be available. If load address changes, the loader takes care of adding the 'correction offset' to each and every location in the instructions referencing addresses of data/functions.

If the preferred load address is available in logical address space, there is nothing to relocate and this process can be quite fast in average case.

There is nothing similar on Linux.


Import Directory Table, Import Lookup Table & Import Address Table (IAT)
There will be one of each of these tables per imported DLL. A directory table has info on the imported DLL plus, points to the lookup and address tables.

Lookup table and Address tables are identical before the corresponding DLL is loaded. Both have arrays of entries which contain RVA (relative virtual address) pointing to the name of the referenced symbol + ordinal. When the DLL is loaded, the Address table entries are overwritten with address of the imported symbols by the loader. (The IAT tables are located in .idata section in the binary and that's how the loader locates these tables).

Note that this involves going over the Export tables entries in the imported DLL, matching names (the export tables are sorted by symbol names so that binary search can be used), finding the ordinal and fetching the address for each. This process is costly and may be done off-line with the assumption of a preferred load address - in a process called 'Binding'.

Calling/Accessing a function/data would thus translate to an indirection from the IAT table.

There are two ways in which an instruction can access IAT. That is, there are two ways to generate code.

This is close to GOT/PLT used on Linux.


Accessing Imported Symbols
The instruction that wishes to use an imported symbol would naturally have to have the address of an IAT entry. This is done in one of two ways:

a. Import Library: This library provides stubs which satisfy the missing imported symbol references. The generated stub has a JMP instruction which has an address constant pointing to an entry in IAT. The instruction would get the contents of that address and jump to it.

CALL 0x0040100C
. . .
0x0040100C:

JMP DWORD PTR [0x00405030]

Although inefficient, this is the method that is used by default by the compiler. The reason is that the compiler can't distinguish between a call to a normal function as opposed to an imported one.

b. Direct addressing:
CALL DWORD PTR [0x00405030]
The address constant above points to an entry in IAT. This is more efficient compared to using stub function. You can direct the compiler generate this type of code by prefixing __dllspec(dllimport)

There is nothing equivalent to import library method on Linux.


Binding
An executable is said to be bound when the IAT table in the file is overwritten with the actual addresses of the symbols in other DLLs (of course, assuming DLL is loaded in preferred address). This can result in a performance boost depending on how large the imported DLL is. For large DLLs with thousands of APIs, this makes sense.

If load address is different, the loader will fix up the addresses. The loader will also have timestamp and checksum of the DLL, so that when either of them change, binding is invalidated and redone.

Binding is sometimes done during installation.

There is nothing equivalent to Binding on Linux.


Explicit and Implicit linking
Implicit linking is the default method. Explicit linking is explicitly making sure that the target DLL is loaded and then looking up the address of the APIs. This is almost always done via the LoadLibrary and GetProcAddress APIs.


Ordinals
Ordinals are numbers which uniquely point to imported/exported symbols. They also form (directly or indirectly) form indices to the actual entries in the import/export tables, which contain more information on the symbols, like name, RVA.


A program can directly access an imported API via explicit linking. But this should only be done if it is guaranteed that a given symbol occurs in that ordinal always. Otherwise, in explicit linking, one should use GetProcAddress to lookup symbols by name. Ordinals are sometimes used in plugin systems to locate a factory method, to instantiate the core class, and the plugin code is generated such that the factory method occurs at a fixed ordinal like 1.


Source:
An In-Depth Look into the Win32 Portable Executable File Format - Part 1Part 2
Linkers and Loaders by John Levine - Link

References:
What Goes On Inside Windows 2000: Solving the Mysteries of the Loader - Link

Tuesday, January 10, 2012

Linkers and loaders

There were various reasons for having set out to understand more about the internals of linking and loading. What's under the hood when I hit on the build icon on the IDE? How do I create a plugin system in C++, where plugins are built with different compilers, C++ ABI compatibility etc.

I had studied linkers and loaders as an academic course (System software by Leland L.Beck). But that was a long time ago and was way too detailed, and it discussed the topic with an abstract machine. I could not go back to that.

I found some other good resources :
[1] Linkers and Loaders by John Levine (Buy it, its worth it)
[2] Linkers by Ian Taylor 
[3] Linkers and loaders by Sandeep Grover
[4] dsohowto by Ulrich drepper
Although they were all good, none of them were concise as much as I wanted. Sometimes I got lost in long descriptions and varying terminologies. So, the aim of this post is to capture my understanding and probably be of help to anyone with the same goal.

- - - 
Lets me assume a modern OS where each process has its own address space and for the most part, a linux platform using ELF file format.

As you might know, there are following kinds of files, during the transition from the source code of a language to the point where it is ready to be run by the OS.
  • Object files
    • files with .o extension -> the result of compiling individual source file or translation unit
  • Libraries
    • Static libraries 
      • Conceptually, its a collection (archive) of object files with a master symbol table. These are meant for statically linking with program interested in its routines. When statically linked, the library object files are included in the output executable. So, these libraries need not be deployed on the target machine.
        These libraries can have unresolved references; the program needing to use this library must provide the necessary dependent libraries to the linker to resolve them.
        Also called archives. they have extension .ar on Linux and .LIB on Windows or Symbian.
    • Shared libraries or Shared objectsShared objects are executables, in the sense that they can be loaded into memory and run unlike static libraries. These are dynamically loaded. They are of two kinds
      • Statically aware
        • An executable might statically (at build time) depend on certain routines and data in shared libraries. In such case, the linker needs to be provided with the shared objects which resolve them. When the executable is started, the shared objects are automatically loaded into memory and dynamically linked.
          They have .so extension on linux, .DLL on windows.
      • Statically unaware 
        • A program might depend on routines & data in SOs at runtime. In such case, it can  explicitly calls system APIs to load and use them. They have .so extension on linux, .DLL on windows.
  • Executable
    • a.out on linux or .exe on windows.
- - -

*.c => *.o => a.out

Lets look at the contents of an object file.

An object file has various parts such as .text, .data, .bss, .symtab etc generated by the compiler/assembler. These are called 'sections' in ELF file format terminology. An ELF object file is thus a concatenation of these sections with a header plus some bookkeeping, to be able to traverse these sections.

Since the compiler has no idea where the program eventually runs in memory,it assumes an address space starting at 0, to place text, data etc. So, for e.g., the first instruction starts at address 0x100 in the object file. So, if the code refers to a global variable 'p', its instruction code might refer to the address 0x2000, if that is the start of the data segment. But, such references would be incorrect when the object code is placed somewhere else in memory. So instead, it just places 0. The compiler delegates this task of placing the correct address to a later stage, by leaving a 'relocation entry'. This entry notes - (a). the address 'A' in .text section, in the middle of the instruction code, where the address of 'p' has to be updated (b). the symbol 'p' whose address needs to be placed at 'A'. This is just a reference to an index in symbol table. Lets say, this is a 'program relocation entry', as we are going to see the other type later.

You can see this for your self by creating a .o file and analysing the contents using objdump
gcc -c mytest.c
objdump -dhsx mytest.o

relocation entries can be viewed using
readelf -r mytest.o

Lets look at the contents of an executable file. On linux, the same ELF format is used for executables.
gcc *.o -o myprog.out

Only linking is done here done by 'program linker'. The prog linker scans all the object files and their sections and merges them. All .text sections from the object files are brought together, all .data sections are brought together, all reloc entries, symbol tables and so on. The linker then recalculates the collective target address space and starts laying out the merged sections accordingly (Needless to say, any references across object files are resolved here).

Since modern OSes provide a separate address space for each process, at a fixed location, for instance 0x8000000 (hypothetical). 

The symbol tables and relocation entries are updated accordingly in linker's in-memory book keeping tables. Then, symbol references in instruction codes are fixed with the correct addresses from the updated symbol table.

The resultant .text section and other read-only sections put together form a segment. Each segment is just a collection of contiguous sections with similar properties (e.g., read-only, executable etc).

While sections are used for linker for symbol resolution and relocation, the segments are to be used by the program loader. The prog loader copies the segments from the executable file in the disk into memory pages, starting at the desired logical address space (0x8000000). If a  segment ends in the middle of a page, a next segment starts at the next page. 

- - -

*c => *.o => mylib.so

The process of creating an SO is almost same as that of creating an executable. Few differences might include, 
  1. SOs do not have a fixed load address, since a fixed address might already be used up by another SO. (so, invariably it starts at 0)
    1. That is, the SOs have to be relocated again on loading. Lets called this 'dynamic relocation'. The program linker creates 'dynamic relocation entries' 'for the dynamic linker. The semantics of the entries is still the same. Add the address at which SO is loaded, to the address identified by the dynamic relocation entries.
    2. So, all the addresses have to be added a fixed offset (load address) on loading. But this is mitigated by using PIC as we will see.
  2. Retaining the relocation entries and symbol tables in the SO file, for dynamic linking
  3. The use of create position independent code (PIC)
When a program that uses a shared object is built, the library needs to be provided to linker. The linker notes the SO library name used in the program in a special section called NEEDED (it does not store the full path, just the soname).
gcc main.c -L/home/user/lib -lmytest.so //-L specifies library search folder

There is also an 'interpreter' section in the program which identifies the dynamic linker program that links the shared object at runtime.

When the program is loaded, the loader finds that there are dependencies and launches the interpreter. The linker then formally looks up for the needed libraries and loads them. If SO is not found, the program terminates. If found, it scans reloc and symtab sections and loads them to its data-structures. It then loads the SO at an available logical address and performs relocations based on dynamic relocation entries. 

Unlike program relocation, which is done after compilation on a build machine, dynamic relocation has performance impact directly proportional to the number of global, static & extern data, function references. A technique called position independent code (PIC), adds an indirection in such references, making the relocations limited to one relocation per variable or one relocation per function, rather than one relocation per use of variable/per function call. The indirection also helps in supporting a feature called 'function imposition', where a function in the executable can override the one in the shared library. Such indirections are localized in a table. Separate tables exist for data and functions, called - Global Offset Table (GOT), Procedure Lookup Table (PLT) respectively.

So, loading a variable to memory is equivalent to 
addr = GOT[offset]
variable = *addr

Now, the table itself is placed at a fixed offset from code, for the code to be able to find it. In other words, the offset from SP to GOT (or PLC) is fixed when a shared object is built.

While GOT has a straightforward design, PLT is some what complicated. You can refer to [1] or [2]. In brief, the first call to any function is arranged to be a call to a routine dynamic linker via GOT table. The routine, then finds the correct address of that procedure and modifies the GOT table such that subsequent calls directly land in the correct routine.

The scheme in Windows/Symbian for shared libraries is similar except that windows uses import libraries for linking with the program, rather than the DLL itself. The data/procedure symbols that are to be 'exported' by the DLL are placed in an ordinal table with indices. The import library has this binding information and it statically resolves those symbols using stub code, during program linking. At runtime, the DLL is found and loaded (like SOs are) and the jump tables in the import libraries take care of binding the call with the routine in DLL.

-  Statically unaware shared objects:
In this variant, an executable program would be loosely dependent on shared libraries. That is it may not directly refer to data or routines in shared libraries. Instead, it uses the system APIs (like dlopen()) to load SO libraries on demand and lookup routines for use. Obviously, the developer would know the nature of routines before hand and use mechanisms like function pointer or virtual interface classes to interact with data/routines on the SO. This kind of shared objects are mainly used in making plugins in plug and play frameworks. 

- - -

-- Reduced memory footprint from using SO
Shared libraries save a lot of disk space when compared to static libraries. But there is some redundancy still, when SOs are loaded in memory. Various applications using SOs might result in multiple copies in memory. The solution that has been devised which involves separating the read-only parts from the other. With this, the system can load RO parts only once while making a new copy of writable (W) parts for each process. 

Since the addresses in RO are bound to data in W, the entire SO has to be placed as a block in logical address space. So, for each process, paging is done such that, RO and W segments are only contiguous in LAS, while physically they may be apart. When paging is done for a new process, the RO physical page is mapped into the LAS of that process.