Compiler

A compiler is a computer program that translates computer code written in a programming language into another language

4 phases of compilation (for C++)

  1. Preprocessing: Expands the code (expansion of macros, removal of comments and conditional compilation)
  2. Compilation: Translates the code into assembly language, which an assembler can understand.
  3. Assembly: Assembler translates the code into machine code or byte code.
  4. Linking: It links various modules of the code, e.g. the functions calls and finally delivers the executable (pay attention to .a and .so)

Frontend that parses the code then you have an intermediate representation that gets transformed. Optimization and code generation comes later.

What is a Compiler?

Two sided problem: Humans that needs to write code and the machines that need to run the program that the human wrote. For lots a reasons, the humans don’t want to be writing in binary and want to think about every piece of hardware. Lots of kinds of hardware. The compiler is the art of allowing humans to think of the level of abstraction that they want to think about and then get that program to run on a specific piece of hardware. There are lots of different kind of hardware such as x86, arm, power pc and gpu and on the other hand, lots of different programming language. So the compiler goes from end-to-end.

Compiler happens in multiple phases and thesoftware engineering challenges is to get maximum reuses on the amount of code that you write, these comilers are very complicated. You have something called the frontend or the parser that is language specific. So you have a C parse (that’s what CLANG) is or C++, Python, that is the frontend. And you’ll have a middle part, which is often the optimizer. And then you’ll have a late part which is hardware specific. Many different layers but theses three big groups are very common.

LLVM is trying to do is trying to standardize that middle and last part. There is a lot of different language that compiles through it: swift, julia, rust clang for C and C++ and objective-c they can all use the same optimization infrastructures which gets better performance and the same code generation for hardware support. LLVM is that layer that is common that all these different specific compilers can use.

Why does it produce the assembly code first?

Note

Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.

Compiler does Recognition, but then it also translates the program into a different output language.

How to turn C++ program down to the machine

C++ by itself is really complicated. Semantics, huge amount of history. Compounding decisions and more features adding up to it.

Front end challenges (CLANG) looked at GCC, which was a the time a industry standardized compiler. Full of global variable which is hard to reuse. Make error messages better than gcc. make compile time better → make it efficient. Making new tools and other analysis tool available. Clang which pushed forward for C and C++ in terms of tooling.

Different phases you go through from C++ to machine. 100+ passes through and get organized in very complicated ways which affects the generated code and the performance and compile time and many things.

In the parse it’s usually a tree it’s called an abstract syntax tree. The idea is that you have a node for the + or the function call you’ll have anode for the call with the function you call and the arguments they pass. This gets lowered into what’s called an intermediate representations : LLVM has one there it’s a control flow graph. So you represent each operation in the program as a very simple like this is gonna add two numbers, multiply or call. They get put in what are called blocks. So you get blocks of these straight line operations instead of into a tree. Much more language dependent with this representation.

Hard thing

The hard things about compilers is the software engineering making it so that you can have hundreds of people collaborating on really detailed low-level work. That’s really hard and which LLVM has done well.

C++ Separate Compilation

In C++, the separation of interface (declaration) and implementation allows the compiler to work in separate translation units:

  1. During the compilation process, each .cc file is compiled independently into an object file .o (contains machine code) only needs to know function declarations
  2. During the linking phase, the linker combines all the object files and resolves the references to functions and variables linker finds the actual implementation of the functions

object file .o

Note that the object file is not directly executable. This is because it hasn’t been linked yet.

This separation allows for modular development and makes it possible to compile and link code independently, even when functions are declared in one translation unit and implemented in another.

Example seen in CS247

Specifically in CS247 Lecture 6.

In List.h (original version), we provided a forward declaration of struct Node. Provide definition of the class in the .cc file.

This is especially useful for separate compilation.

g++ -std=c+=14 List.cc -c
  • -c compiler flag lets us produce and object file .o

Use object files to store the result of compilation, and reuse it if the .cc files haven’t changed. With many files, significant speedup while developing. We prefer to put implementations in .cc file for the following reasons:

  1. List.cc changes → recompile List.cc into List.o
  2. List.h changes → all .cc files that include List.h must recompile
  3. A.h changes, which includes List.h → any .cc files that include A.h must change

At the end, relink all .o files together.

Forward declaration is preferred

We prefer forward declarations of classes where possible - minimizes recompilation.