BetterOS.org : An attempt to make computer machines run better


HOME | BETTER LINUX | GAMES | SOFTWARE | TUTORIALS | ABOUT | REFERENCES | FORUM | WEB LOG |

Tutorials: INDEX | C TUTORIAL 1 | 2 | 3 | 4 | LOW-LEVEL GRAPHICS |

Tutorials

C Tutorial

The things you should learn

Tutorial 1: Overview


Introduction
 There are many tutorials available online that attempt to teach people how to program in C. That is how I learned, and many others have learned the same way. However, many of these tutorials fail to teach the underlying mechanisms behind the concepts they teach. As a result, many programmers see programming as "typing magic words", and the quality of their code suffers as a result.  Programming has nothing to do with "magic words". This tutorial attempts to explain C at a level beyond the "magic words" type tutorials, in an effort to give new programmers a better understanding of programming and write better code.


What is C?
 What is C? C is a programming language of course. More specifically, it is a compiled programming language. This means that when you compile a C program, the compiler reads the code and converts it into assembly language (compile) and then converts the assembly language code into native machine code (assemble), then finally, it finds all the other code that your program links to and adds the necessary links into the executable (link). At no point is there any magic involved.


What is C not?
 C is not a program, it is not a piece of software; it is a programming language, and there are many compilers that understand it. C is not an old, obsolete version of C++ or C#; C is a separate language and is still relevant and useful today, and in my opinion, superior. C is also not magic; everything that can be done in C can be understood from the highest level, down to the lowest level, nothing happens that cannot be understood.


Hello World
 The famous "hello world" program, is the first program every programmer writes. It is one of the simplest possible programs to write, but there is still a lot to learn about what happens below the surface.
 The hello world code in C:

 #include <stdio.h>

 int main(int argc, char **argv)
 {
  char str[]="Hello world!!";
  printf("%s\n",str);

  return 0;
 }

There it is, very simple to look at, only 5 real lines of meaningful code. However, all 5 of those lines are worth looking at in detail.


Standard C library and the C preprocessor
 The first line is #include <stdio.h>. Most tutorials just tell you that you need this in your code. Some better ones might tell you that you need it to use certain functions, or that it defines those functions. What does it really do though? Surely this line is magic?
 Wrong. Most functions that you call in C are not actually part of the C language, but are instead part of the standard C library, which is provided by your operating system. In order for your program to call these functions correctly, your compiler will need to know how the function is defined. That is what the file stdio.h does. It contains function prototypes for standard C library functions. Once the prototype is included in your program, you can safely call the functions which they define and your compiler will be able to determine the correct way to pass all the arguments to those functions. Without these prototypes, you can still call the functions, but if you try to call a function with the wrong type of argument (a double or float instead of and int or char), then sometime bad will happen (undefined behavior). The functions (what they do ) aren't actually defined in that file, just how to call the functions. The functions are actually defined in the standard C library.
 That line is still more interesting though. The #include part isn't a magic word either. That is what is called a C preprocessor directive. During the compilation process, there are a few stages that take place. The very first one is called the C preprocessor. The C preprocessor finds all the lines that start with # (preprocessor directives), and it interprets them, removes them, or replaces them with something else. The #include directive finds the file specified (by searching the C compiler's include path), and when it finds it, it removes the line and replaces it with the entire contents of the specified file. This sounds scary, but most header files contain only function prototypes and macros and constants (macros and constants are also preprocessor directives), and those things don’t use any resources in the final executable.
 Most compilers have to option to allow you to see what your code looks like after it has been preprocessed, but not yet compiled.


Function definitions and pointers
 In the next significant line, int main(int argc, char **argv) we are defining our main function. You probably heard before that main is the start point of your program and it has to be defined like this. However, that is partially untrue and doesn't explain why.
 main is not the start point of a program, it is just the entry point of your code. Programs start wherever the linker says the program should start. Almost always this is a function called _start. _start sets up important things behind the scenes and then calls a function in the standard C library, that function then calls main in your program. _start is defined in an object code file which your compiler will automatically find and link into your final executable.
 One thing that _start does is find out what arguments were passed to the program when it was called, it then passes these to the C library which passes them to your main function in the format and order you know and love (or will learn). This is why it is important to declare main the same way in all your programs, because if you don't, the C library will still try to call a function that returns and integer and takes two arguments, if your function doesn't match that, then bad things will happen (undefined behavior).
 Let's also look at the arguments. The first argument is an integer, we defined it as an integer with int. This argument represents the number of arguments that have been passed to the program from the command line. Remember that this includes the name of the program as well. The second is **argv. These are the actual arguments that have been passed to the program. The type of the argument is a bit misleading at first, we called it a char, but then we added two *s before the name of the argument. Those asterisks (*) denote that it is a pointer to the type mentioned. Since there are two of them, and we used the char type, the argument is a pointer to a pointer to character, also known as a 2D array of characters. Pointers are something that we will get into later as they are a very important aspect in C. For now you should at least have an idea of what a pointer is. Normally a variable contains a value of the type that the variable was declared as, however, a pointer instead contains the memory address of a value of the type declared. So a pointer to a pointer to a character (char **) contains the memory address where a memory address is stored, the second memory address is the location of a character value. In C, we use usually use character pointers (or arrays) as strings.


Assignment, initialization, and strings
 In the next line, char str[]="Hello world!!", we create a new variable called str. We declared it as a character, but we also added [] after it. This is actually an unusual notation, but essentially it creates an array of characters. str becomes a pointer to a character as a result (an array and a pointer are almost the same). The main difference between using *str and str[] is that while they both create a pointer, but *str would not allocate any memory, str[] on the other hand, allocates memory and sets the pointer to the address of that memory. Usually, we would explicitly state how much memory we want to allocate by placing the number of characters between the brackets, however, this time we didn't. The reason this is ok this time is because we also initialized the string at the same time as we declared it. The assembly code that this gets compiled to is really cool, but outside of the scope of this tutorial (this is for C, not assembly). Since we specified an initializer, the compiler can automatically figure out how much memory needs to be allocated. In this case it will allocate 14 bytes (13 characters, plus one character to terminate the string).
 Another thing that you should know is that str does not actually point to the whole string, str points only to the first character in the string. It's a character pointer, not a string pointer. However, C programmers usually use character arrays as strings, this works because the memory for the whole string is allocated in one continuous block. This means that although str only points to the first character, str+1 points to the second character, str+2 points to the third character, str+3 points to the fourth character, and str+i points to character i. However, it is important to realize that str+3 in this case does not equal the character 'l', instead it equals the memory location where 'l' is stored. If we want to access the character stored there, then we would need to dereference the pointer (str[3]), which we aren't doing in this example, so I will save that for later.


Function calls
 The next line is the line that actually displays the string. It calls the function printf. printf is part of the standard C library on your system. So we are actually calling other code which is located in the standard C library (libc). A declaration of printf, its prototype, is located in stdio.h which we included earlier, this prototype is not the code that will be executed when you call the function, it just tells the compiler how the function needs to be called; the actual code is located in libc.
 Notice that we passed two arguments to printf, both are strings. We know that strings in C are actually an array of characters. However, we were able to just enter a string directly as an argument for printf ("%s\n"). This works because the C compiler is smart enough to automatically know that it needs to allocate and initialize memory for this string, and pass the pointer to this string into printf instead. We could have done the same thing for the "Hello World!!" string, which would have made our C code 1 line shorter, however, it wouldn't increase efficiency because the C compiler will essentially do the same thing either way, and it gives me less of a chance to explain things.
 printf is also a bit of an unusual function; it takes a variable number of arguments. In C, this is rare because function overloading is not allowed. printf is called a variadic function. For now, let's not worry about how this works, we will get into it later, for now, just remember that this is unusual and you will rarely create this type of function.


Returning
 The final line of code is the return statement. In C, this is not strictly necessary, however, you should always write it anyway. Return is actually a very important instruction in assembly code (which C gets compiled to), and an assembly program will crash if it is not there, however, in C it can be left out if you know you won't need the return value (main sometimes and void functions). The reason for this is again because the compiler knows that all functions need a return instruction at the end, and when it compiles your code, it will add in the return instruction automatically.


Run-down
 Let's quickly take a look at the entire compilation and execution process for our simple "Hello World" program. From beginning to end.

1.  Programmer writes the code. The code is a simple text file, nothing fancy.
2.  Compiler preprocesses the code (C Preprocessor).
 This searches through the code for preprocessor directives (lines starting with '#') and removes them, often replacing them with some other valid C code. The preprocessor starts parsing from the first line and continues moving down until the end. The C preprocessor does not parse any of the C code, which means it skips right over any C flow control like loops or if/else statements.
3.  Compiler compiles the code (C Compiler).
 This takes the output from the C Preprocessor and compiles it. This means that it reads the C code and converts the C code into assembly language.
4.  Compiler assembles the code (Assembler)
 This takes the output from the compiler and converts it into object code. Object code is almost machine code, except that it usually has some missing parts and some extra information which tell the linker how to fill in those missing parts (with code from libraries).
5.  Compiler links the object code (Linker)
 This takes the object code and creates links in the object code to the various library code that you used in your C program. The linker also coverts the object code into native machine code. Machine code is directly executable by your computer's processor (CPU). This is your final executable file.
6.  User runs the executable
7.  Operating system loads the machine code into memory
8.  Operating system passes control to the executable code
9.  Processor executes instructions one by one until the program exits (program passes control back to the OS)


Compilation walk through
 Let's take a quick look now at how exactly the "Hello World" program gets compiled an executed step-by-step and line-by-line.

 Preprocessor:
 The preprocessor starts from the top of the file. The first line it reads is:
#include <stdio.h>
The preprocessor will then locate this file (assuming it is in the compilers include path), then it will copy the entire contents of that file into the place of the line #include <stdio.h>. The preprocessor will then sequentially search through the rest of the code for more preprocessor directives. Since there are none, it will give everything over to the compiler.

 Compiler:
 The compiler starts from the top of the file. The first thing it reads is all that code that got added by the preprocessor (from stdio.h). This includes all the function prototypes, most importantly, this one:
int printf (char *__format, ...);
*simplified for better understanding.
The compiler remembers this in case you call printf later in your code (which we will).
 Then it moves on, reads and remembers all the prototypes, and gets to our main function.
int main(int argc, char **argv)
It sees how we declared our function, returning and int, and taking an int and a 2D array of chars as arguments. It will remember this definition in case something calls this function.
 Then it moves on to the next line which is the curly bracket ({), at which point the compiler figures out that you are starting the main body of your main function.
 Next, the compiler reads the next line, which is:
char str[]="Hello World!!";
It sees that you are declaring and initializing a character array. So it creates code to store that data in memory, and remembers that you called that variable "str".
 Next line is:
printf("%s\n",str);
Now the compiler sees that you want to call a function. It sees that the function is called "printf". The compiler remembers that it was already given a prototype for that function, so it checks to make sure the way you are calling it is correct. It checks the first argument, which is "%s\n", which is a string, or a character array (char *), so that one is correct. Next it checks the second argument, str. Last line it saw your declaration, so the compiler remembers what str is. It is a char * as well, but the printf prototype says ... for the second argument. That ... basically means "anything", that's how the variadic function works. So your call matches the declaration, which means that there is no problem and the compiler generates the necessary assembly code for the function call.
 Then it moves on, skipping the whitespace, until it gets to:
return 0;
At this point, it knows that the function is over and it should generate the code needed to return control back to where your main function got control. However, first the compiler goes back and checks your function declaration and compares what you are returning (0), with what you declared you would return. Since 0 is an integer, and you declared that you would return and int, there is no problem and the compiler generates the necessary assembly code.
 Next it moves on to the next line, which is the closing curly bracket (}), indicating that you are now finished with defining your function. At this point, the compiler will also forget about str, because you declared str inside the function, the compiler assumes that you only want to use it inside the function.
 Then it goes to the next line and sees that the file is finished, so it passes everything over to the assembler.


Quiz
 Now let's do a quick quiz to see how much you remember.
Answer the following questions:

1. Where is the function printf defined (the actual code)?
2. What does a function prototype tell the compiler?
3. What exactly does #include do when it is parsed?
4. If a line starts with #, what reads that line and when does it happen?
5. In the hello world program, what does str point to?
6. What is a character array usually used for in C?
7. What does the compiler do if you forget the return statement at the end of a function?
8. What is in stdio.h?
9. What character is always at the end of a string?
10. How magical is C?


I know we didn't learn very much in this tutorial, but next tutorial will focus on actually writing code.
Goodbye, see you next time!