Some Notes About Compilers and Libraries

Some notes about compilers and libraries
========================================

* What's a compiler?
  ------------------
When we write a program, we usually do it in some high-level language like
C or Fortran.  We might write a line like this:

	printf ("Hello, world!\n");

That's not something that our computer's processor can understand,
though.  The C statement needs to be translated into a different,
low-level language of ones and zeros that expresses the same thing in
terms of instructions that the CPU can understand.  The result might
look like:

	1001110001010111011110001001111001.......

To get from the more-or-less readable C statement to the opaque binary
statement, we need some kind of translator.  That translator is called
a "compiler".


* How do compilers work?
  ----------------------
Compilers actually do this job in several discrete steps.  Let's say
we've written C a program called "hello.c", and we'd like to compile it.
We might type:

	gcc -o hello hello.c

The resulting binary file, "hello", contains instructions that our CPU
can understand.

Consider the following diagram, which shows the steps that gcc goes
through to create a binary version of our program:



The first thing gcc does is run our "hello.c" file through a "pre-
processor".  This is actually a separate program, called "cpp", which
can be used by itself if you like.  The pre-processor looks for statements
beginning with '#' and for macros defined by them.  For example, in
a real "hello world" program, we'd probably write something like this:

	#include <stdio.h>
	int main () {
		printf("Hello, world!\n);
	}

The "#include" statement is a pre-processor directive, and is handled
by cpp.  When this part of our file is run through the pre-processor,
the contents of the file stdio.h are found and inserted at this point,
just as though we'd typed them directly into the hello.c file.  

If you've defined any macros with #define, cpp will go through your
progam and replace each macro with its value.  If you've used any
#ifdefs or similar pre-processor conditional statements, cpp will
evaluate them and insert the appropriate content into your program.
When cpp is done with the program, it should contain pure C code, with
no pre-processor statements left.  (If you want to spy on what the
program looks like at this stage, try running your program through cpp
yourself: type "cpp -P hello.c > hello.out".)

Now the main work of the compiler happens.  gcc takes this "purified"
C code and converts it into a binary form that's digestible by the
CPU.  But what about functions that aren't defined in our program,
like "printf"?  How can the C compiler write CPU instructions for
these functions?  In fact, it can't: instead, it just inserts
placeholders in the code for now.  (What does your program look like
at this stage?  It's called an "object" file, and you can tell gcc to
stop at this stage and write the output to a file if you like: "gcc -c
hello.c" will create an object file called "hello.o".)

The placeholders referring to things that aren't in your hello.c file
are resolved in the final step, which is called "linking".  In this
stage, gcc invokes another program, called "ld", which looks through a
set of standard libraries, trying to find a function called printf.
We'll talk about how libraries are created soon, but for now you just
need to know that a library contains pre-compiled chunks of code that
correspond to functions like printf.  The linker copies any chunks it
needs from the libraries, and inserts them into the appropriate places
in your program.

There are three important things to note about linking:

First, the chunks of code in the libraries are pre-compiled, so
they're already binary code that's ready to be used by your CPU.

Second, if the linker can't find a chunk of code corresponding to a
function that you've used, it will spit out an error message telling
you that it has run into an unresolved reference (your program refers
to a function that can't be found).  This may mean that you need to
tell the compiler to look elsewhere, in other libraries besides the
standard ones.  (Or it may mean that you have a typo in your program!)

Third, the linker only copies the functions that your program
really uses.  It doesn't insert a copy of the whole library into
your program.

* Static versus Dynamic Libraries:
  --------------------------------
There are actually two different kinds of libraries: static and
dynamic libraries.  Everything we've said so far applies only to
static libraries.  A static library has a name like "libsomething.a",
with ".a" standing for "archive".  Static libraries are used by the
linker as we described above.

Dynamic libraries are slightly different.  When a program uses dynamic
libraries, the binary code for the functions you use isn't physically
inserted into the binary file created by the linker.  Instead, a
reference is inserted into the file.  This reference says that, when
the program is run, the function should be loaded as needed from a
dynamic library.  Dynamic libraries are files that have names like
"libsomething.so", where the ".so" stands for "shared object".  (These
names may have a version number appended onto them, also.)

When you run a program that uses dynamic libraries, it's actually
being run by something called the "dynamic linker/loader", which is
just another program called "ld.so" or "ld-linux.so".  As the program
is running, ld.so looks for references to functions in dynamic
libraries and fetches the necessary binary code from those libraries.

Why would you want to use dynamic libraries instead of static
libraries?  There are several reasons:

- Dynamic libraries save disk space.  If every program contained its
  own copy of "printf" and every other function in the C libraries a
  lot of space would be wasted.  With dynamic libraries, there's only
  one copy of these functions.

- Dynamic libraries make upgrades easy.  Imagine that there's a
  serious bug in an old version of a library, and you want to install
  a newer version.  If it's a static library, it's not sufficient to
  just install the new "libsomething.a" file.  Programs that were
  compiled with the old static library will still have buggy functions
  inside them.  In order to give all of your programs the benefit of
  the new library, you'd need to recompile all of them, so that the
  new, un-buggy functions from the library would be copied into the
  new binary files.

  With Dynamic libraries all you need to do is install a new
  "libsomething.so" file.  Any programs that use the library will
  automatically, immediately, see the benefit of the upgrade, without
  your needing to do anything else.  This can be very important if the
  bug is a security hole.

- Dynamic libraries save memory.  When you run a program, it gets
  copied into memory.  Just as with disk space, multiple copies
  of library functions waste memory.  When programs use dynamic
  libraries, the operating system is smart enough to load only
  one copy of each library into memory.  This copy is shared by all
  of the programs that need that library.

For all of those reasons, most of the programs installed on your
computer use at least some dynamic libraries.

If you wanted to see what dynamic libraries your "hello" program
used, you could type:

	ldd hello

You'd probably see something like this:

        linux-gate.so.1 =>  (0x0020d000)
        libc.so.6 => /lib/libc.so.6 (0x00337000)
        /lib/ld-linux.so.2 (0x001da000)

In the example above, we see that our "hello" program refers to
"libc.so.6".  ldd tells us that this can be found on disk in
/lib/libc.so.6.  ldd also says that our program refers to
"/lib/ld-linux.so.2", in this case specifying exactly where the
dynamic library lives on disk.  Finally, there's a reference to
something called "linux-gate.so.1".  This isn't a real library on disk
anywhere.  Instead, it's a fictitious library that the Linux kernel
creates in memory in order to give programs access to some kernel
functions.

What about the numbers in parentheses?  That's a little more
complicated.  They're memory addresses, but they don't point directly
to the locations in physical memory where these dynamic libraries are
loaded.  For an explanation see:

 http://unix.stackexchange.com/questions/116327/loading-of-shared-libraries-and-ram-usage

OK, so dynamic libraries sound great.  But what's the down side?
Well, here's one: What happens if you copy your program to another
computer that doesn't have all of the dynamic libraries that the
program needs?  It probably won't run.  If we ran "ldd hello" on the
other computer, we'd find that some of the required libraries couldn't
be found.  When compiling a program with dynamic libraries, it's also
possible to specify a particular version of a library, or even to say
where we're going to expect to find the library on disk.  These things
can also make a binary file un-portable if another computer has a
different version of a library, or if the library is stored in a
different location on disk.

* Creating and Using Static Libraries:
  ------------------------------------
It's very easy to create a static library.  Say, for example, that
I have a file called graph.c that contains a lot of spiffy graphics
functions that I've written.  The file doesn't contain a program
(no "main()"), just the graphics functions.

The first step in turning this into a library is to convert our
C code into binary code.  This isn't a whole program, so we're
going to skip the "linking" step that gcc did in the example above.
We can do this by typing:

	gcc -c graph.c

This tells gcc to just do the pre-processor and compile steps
and then stop.  It produces an output file called graph.o,
where the .o stands for "object".  An object file contains
binary code that has been compiled, and is ready to be 
inserted into a program.

The "ar" command can be used to pack object files into a 
static library and index them for later use.  For example, we
could create a new library containing our graphing functions:

	ar -csr libgraph.a graph.o

where "c" means "create the library if it doesn't exist",
"s" means "generate an index", and "r" means "replace anything
of the same name that is already in the library".

Now we have our new library, libgraph.a, and we can use it
when we compile programs.  Say, for example, that we want to
use one of our fancy new graphics functions in the hello.c
program.  If libgraph.a is in the current working directory,
we might type:

	gcc -o hello hello.c -L. -lgraph

The "-L" qualifier tells gcc to look in an additional
directory when trying to find libraries.  (In this case, the
directory is ".", the current working directory.)  The "-l"
qualifier says to link the program with the following library,
where we leave off the "lib" prefix and the ".a" suffix on
the library's name.  (In the early days of the GNU project
there was a library called "libiberty.a", so you could type
"-liberty".)

Alternatively, gcc lets you add other directories onto the linker's
search path by defining the environment variable "LIBRARY_PATH".  Just
put a colon-separated list of directories into this variable, and gcc
will add these directories to the standard list of places where it
looks for static libraries.

You can add more than one object file to a given library.
The command "ar -t libsomething.a" will show you the names
of the object files that were put into the library.  Try
this with a large library like CERN's libpacklib.a, and you'll
see thousands of object files.

To see the names of functions and symbols in the library's index, you
can use the "nm" command.  Each name will be shown with a one-letter
symbol.  The names of the functions in this library will be identified
by a "T".  The nm command is often useful when you're trying to figure
out which library contains a particular function.

* Creating and Using Dynamic Libraries:
  -------------------------------------
Making a dynamic library requires a different procedure.  Let's start
again with our "graph.c" file, but this time turn it into a dynamic
library instead of a static one.  As with the static library, we begin
by compiling the code, but this time we'll add another qualifer to our
gcc command:

	gcc -c -fpic graph.c

The "-fpic" tells gcc to create an object file with
"position-independent code".  This is neccessary if we want to make a
dynamic library.

Then, we pack our object file, graph.o, into a library, but this time,
instead of "ar", we use gcc itself to do the job:

	gcc --shared -o libgraph.so graph.o

We now have a new dynamic library!  We can link a program with the
library in the same way we linked to the static library:

	gcc -o hello hello.c -L. -lgraph

(assuming that libgraph.so lives in the current working directory).

We still have some work to do, though.  At this stage, if we tried to
run our program we'd probably see error messages saying that the
functions in our graphics library couldn't be found.  This is because
the dynamic linker/loader (ld.so) doesn't know about our newly-created
library, since we probably didn't put the library into any of the
places where ld.so normally looks.

We can add new directories to ld.so's search path by defining the
LD_LIBRARY_PATH environment variable.  Like the LIBRARY_PATH variable
I mentioned earlier, this is a colon-separated list of directories
that contain extra libraries.  You might, for example, put your
newly-created library into $HOME/lib (a subdirectory called "lib"
under your home directory).  Then you could say:

	setenv LD_LIBRARY_PATH $HOME/lib

if you're using tcsh as your login shell, or:

	export LD_LIBRARY_PATH=$HOME/lib

if you're using bash.

You should then be able to run your program without errors.

* Telling gcc to Choose Static or Dynamic Libraries:
  --------------------------------------------------
If both static and dynamic versions of a library are available,
gcc will, by default, use the dynamic version.  If we want to
force gcc to use the static versions of libraries, we can
give the "--static" qualifier.  For example, if we said:

	gcc --static -o hello hello.c -L. -lgraph

gcc would look for libgraph.a instead of libgraph.so.

Note that you won't necessarily have static versions of
all libraries installed on your computer.  Most operating systems
install the dynamic libraries only, by default.

* Congratulations!
  ----------------
That's it!  You now know how to make and use libraries!