Scroll Launchpad Articles Jobs Search Blog

Harshit Singh

Jan 17, 2025 • 5 min read

How does a Compiler work ?

The functionality behind a complier and it's architecture

We all have been using compiler since we started with Programming but we essentially forget the platform where we write the code, how does that even work. A compiler is a specialized software that translates high-level programming languages (like C, Java, or Python) into machine-readable code. This process enables a computer to understand and execute the instructions written by a programmer. Compilers play a critical role in software development by bridging the gap between human-readable source code and the machine's binary code (or assembly code).

The process of compiling code is complex and involves multiple stages. In this article, we’ll walk through the steps involved in compilation, focusing on key concepts like lexical analysis, syntax analysis, and assembly.

Step 1: Lexical Analysis (Scanning)

The first stage of the compilation process is lexical analysis or scanning. In this step, the compiler takes the raw source code and breaks it down into a series of tokens. Tokens are the smallest units of meaningful code, such as keywords (if, for, while), operators (+, -, *), variables, constants, and punctuation (like parentheses and semicolons).

Example:

Consider the following simple line of code in C:

int a = 5 + 3;

Lexical analysis breaks this down into tokens:

int (keyword)
a (identifier)
= (assignment operator)
5 (integer literal)
+ (addition operator)
3 (integer literal)
; (semicolon)

The purpose of lexical analysis is to ensure that the input source code is correctly structured at the lowest level. The lexical analyzer will report an error if the code contains illegal characters (like unrecognized symbols).

Step 2: Syntax Analysis (Parsing)

After the lexical analyzer has identified the tokens, the next step is syntax analysis or parsing. In this phase, the compiler checks whether the tokens' sequence adheres to the programming language's grammatical rules. Syntax analysis ensures that the structure of the code is correct and follows the language’s syntax rules. It also checks for ambiguity, grammar validations and whether they even form a valid grammar.

A syntax tree is created during this step. It visually represents the hierarchical structure of the code based on the language's grammar.

For example, the following C code:

int a = 5 + 3;

After syntax analysis, the parse tree might look like this:

Assignment
 ├── Type: int
 ├── Variable: a
 └── Expression
      ├── Integer: 5
      ├── Operator: +
      └── Integer: 3

The parse tree helps the compiler understand the structure and the relationships between different parts of the code. If the code contains any syntax errors (e.g., missing parentheses, unmatched braces), the syntax analyzer will raise an error.

Step 3: Semantic Analysis

Semantic analysis checks the logical correctness of the code after ensuring its syntax is valid. In this phase, the compiler verifies that the program makes sense in terms of the operations being performed. For example, it checks for:

Type mismatches (e.g., adding a string to an integer).
Variable scope (e.g., using a variable before declaring it).
Correct function calls and returns.

If a program violates semantic rules, the compiler will flag these errors.

Example:

int x = "Hello";

In this case, the compiler will raise a semantic error because "Hello" is a string literal, not an integer.

Step 4: Intermediate Code Generation

Once the code is semantically valid, the compiler translates it into an intermediate representation (IR). The purpose of the intermediate code is to provide a platform-independent representation of the source code that can be further optimized and then translated into machine code.

There are different types of intermediate representations, such as:

Three-address code: A low-level representation where each instruction uses at most three operands.
Abstract Syntax Tree (AST): A tree structure that abstracts the code's syntax.

For example, the code a = 5 + 3 might be represented as:

t1 = 5 + 3 a = t1

This intermediate code simplifies the process of optimizations and translation into machine code later in the compilation process.

Step 5: Optimization

At this point, the compiler attempts to optimize the intermediate code. Optimization can occur at multiple levels:

Loop optimization: Improving how loops are executed.
Constant folding: Precomputing constant expressions like 5 + 3 into 8.
Dead code elimination: Removing parts of the code that do not affect the program’s output.

Optimizations can improve the performance of the generated machine code, such as making it run faster or take up less memory.

Step 6: Code Generation

The next phase is code generation, where the compiler translates the optimized intermediate code into machine code or assembly code. The output of this phase is specific to the target architecture, such as x86, ARM, or MIPS.

For example, the intermediate instruction:

a = 5 + 3

might be translated into the following x86 assembly code:

MOV EAX, 5 ; Load 5 into the EAX register ADD EAX, 3 ; Add 3 to EAX MOV [a], EAX ; Store the result (8) into variable a

The generated assembly code consists of low-level instructions that can be directly executed by the CPU (after being assembled into binary code).

Step 7: Assembly and Linking

In the final stages, the compiler generates an object file (.obj or .o) containing the machine code for the program. However, an object file alone is not a complete, executable program. It needs to be linked with other object files and libraries to produce a final executable file.

The linker takes care of resolving references between different parts of the program (e.g., function calls or variable references) and combining multiple object files into a single executable. The linker also links in system libraries, which provide essential functionality like input/output or memory management.

Conclusion

A compiler is a powerful tool that converts high-level source code into machine-executable code. It involves several stages, each with its own purpose:

Lexical analysis (tokenization) breaks down the source code into tokens.
Syntax analysis ensures the code follows the grammar of the programming language.
Semantic analysis checks the logical correctness of the code.
Intermediate code generation creates a platform-independent representation of the code.
Optimization improves the performance of the intermediate code.
Code generation produces machine-specific assembly or machine code.
Assembly and linking convert the code into a final executable program.

Each of these phases plays a critical role in producing efficient, correct, and optimized code, making compilers an essential component of modern software development.

Join Harshit on Peerlist!

Join amazing folks like Harshit and thousands of other people in tech.

Create Profile

Join with Harshit’s personal invite link.