The compilation process of a c source file

Compiling a C source file , into an executable program , involves multiple steps . They are as follow :

Table of Contents

Preparation of the source files

The first step in compiling a C source file , is the preparation of the source file for preprocessing .

The first step in the preparation step , is that the physical source file , characters , are mapped to the source character set , so multibyte encoding , or other encodings , are mapped to the source character set .

Next , trigraphs are replaced by the characters which they represent . Trigraphs are formed of two interrogation marks , followed by a character , and they are used as replacement for certain characters . For example ??( can be used as a replacement for [ .

Finally , any backslash followed by a new line , is deleted . A backslash followed by a new line , can be used as a way to write a preprocessor directive , such as #define , on multiple lines .

Preprocessing

The source file , is now formed of sequence of characters , and from whitespace . Some of these sequence of characters , are considered to be preprocessor tokens , others are comments , thirds are not related to preprocessing .

What happens next , is that each comment , is replaced by a single white space .

After that , the preprocessor tokens are interpreted . Directives such as #ifdef are executed , macros such as #define x 1 are expanded . And finally , the #include directives are performed , causing referenced headers , or source files , to be first prepared for preprocessing as in the first step, and later on preprocessed as in the second step.

Once preprocessing is done , preprocessing artifact are deleted .

The preprocessing step , can be performed alone , by issuing the command :

$ gcc -E source.c > name_of_preprocessed_file.i
# If using the gcc compiler .

$ cc -E source.c > name_of_preprocessed_file.i
# If using the cc compiler .

$ cpp -E source.c > name_of_preprocessed_file.i
# If using the c preprocessor .

As an example , this is a C source file :

/* This is a comment */
#define	x 0
int y = 1,/* Comments are replaced by a single space*/y;


int z = x

And this is the output , of preprocessing this file :

$ gcc -E source.c
int y = 1, y;


int z = 0

$ gcc -E source.c , preprocess the source.c file , and output its content . Comments are replaced by one space , and preprocessor directives are executed. No C syntax checking is performed .

Getting ready for the execution environment

The third step , is to get ready for the execution environment . Character constants and string literals , are translated from the source character set , into the execution character set , including any escape sequences such as \n.

Adjacent string literals, such as "a" "b" are concatenated into one .

The resulting file from this step , is called a translation unit .

Translating into assembly

The resulting file from the first three steps , called a translation unit , is formed of tokens , and whitespace .

The tokens are syntactically and semantically analyzed , with regards to the C standard . The high level C language , is translated into a low level assembly language .

Each cpu architecture , can have its own assembly language , for example the x64 assembly or arm assembly .

As such , when compiling , a target architecture environment can be specified .

Compiling to an architecture , different from the one on which the compiler is running , is called cross compiling .

The translation into assembly step , can be performed , by issuing the command :

$ gcc -S source.c -o name_of_preprocessed_file.s
# If using the gcc compiler .

$ cc -S source.c -o name_of_preprocessed_file.s
# If using the cc compiler .

As an example , the following source file :

int main(void){ 
	int x =0; 
}

is converted to assembly :

$ cc -S source.c
# Translate source.c into source.s

$ cat source.s
# output the content of source.s

	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 12
	.globl	_main
	.p2align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## BB#0:
	pushq	%rbp
Lcfi0:
	.cfi_def_cfa_offset 16
Lcfi1:
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
Lcfi2:
	.cfi_def_cfa_register %rbp
	xorl	%eax, %eax
	movl	$0, -4(%rbp)
	popq	%rbp
	retq
	.cfi_endproc


.subsections_via_symbols

Assembling

In this step, the generated assembly language , is mapped to machine language . Machine language is only formed of 0 and 1 , as such the source file is now translated to 0 and 1 .

The file resulting from this step , is known as object code . Object code , is not yet executable .

The assembling step , can be performed by issuing the following commands :

$ as -c source.s -o source.o
# If using as , assemble an 
# assembly file into an 
# object file .

$ gcc -c source.c -o source.o
# If using gcc  , translate 
# a source.c file into 
# object code . 

$ cc -c source.c -o source.o
# If using cc , translate a 
# source.c file into
# object code .

Linking

In this step , an executable file , is created from object code files. Multiple object code files are combined , parts of static libraries are merged , and external references are resolved . Each operating system , has its own executable object code format .

Linking can be performed by using the ld command , or by providing options for gcc , or cc . For example , the following source file :

/*source.c file */
#include<math.h>
int main(void){
  double number	= sqrt(2.9);
}

can be converted to object code using :

$ gcc -c source.c

The object code , can be statically linked against the C math library , and made into an executable file by issuing the command :

$ gcc source.o -lm -o executable_file_name

Final notes

A compiler can perform all these steps , at once . Like for example issuing gcc source.c or cc source.c , the source file is translated into an executable file . Multiple source files , can be passed to gcc , or cc .

What is a trigraph in c ? [tw_audio_player file= "https://twiserandom.com/wp-content/uploads/2019/10/What-is-a-trigraph-in-c.mp3" ] A trigraph in C is formed of three characters, the first two characters are interrogation marks ??, and the last character is just a regular character. ISO 646, standardized ASCII, and made national variants of it.…
C : source , execution , basic , and extended character sets There are two character sets in C . The first one is the source character set , which is the set of characters , in which a C source file , is written . So the source character set…