courses/compiler/chapter_3.md
Simon From Jakobsen 153c71d78e add to chapter 3
2024-08-28 13:46:00 +00:00

5.1 KiB

Parser

In this chaper I'll show how I would make a parser.

A parser, in addition to our lexer, transforms the input program as text, meaning an unstructured sequence of characters, into a structered representation. Structured meaning the representation tells us about the different constructs such as if statements and expressions.

Abstract Syntax Tree AST

The result of parsing is a tree structure representing the input program.

This structure is a recursive acyclic structure storing the different parts of the program.

This is how I would define an AST data type.

type Stmt = {
    kind: StmtKind,
    pos: Pos,
};

type StmtKind =
    | { type: "error" }
    // ...
    | { type: "let", ident: string, value: Expr }
    // ...
    ;

type Expr = {
    kind: ExprKind,
    pos: Pos,
};

type ExprKind =
    | { type: "error" }
    // ...
    | { type: "int", value: number }
    // ...
    ;

Both Stmt (statement) and Expr (expression) are polymorphic types, meaning an expression, for example, can be either an addition operation containing 2 inner expressions or an integer expression containing the integer value, etc. This can also be implemented with classes and sub classes.

For both Stmt and Expr there's an error-kind. This makes the parser simpler, as we won't need to manage parsing failures differently than successful parslings.

Consumer of lexer

To start, we'll implement a Parser class, which for now is simply a consumer of a token iterater, meaning the lexer. In simple terms, whereas the lexer is a transformation from text to tokens, the parser is a transformation from token to an AST, except that the parser is not an iterator.

class Parser {
    private currentToken: Token | null;

    public constructor(private lexer: Lexer) {
        this.currentToken = lexer.next();
    }
    // ...
    private step() { this.currentToken = this.lexer.next() }
    private done(): bool { return this.currentToken == null; }
    private current(): Token { return this.currentToken!; }
    // ...
}

This implementation should look familiar compared to the lexer. We use the currentToken as a 'buffer', and then just use the .next() on the lexer.

Just as the lexer, we'll have a .pos() method, returning the current position.

For convenience, although there are other ways of doing it, we'll implement another public method on Lexer, which will return the lexer's current position.

class Lexer {
    // ...
    public currentPos(): Pos { return this.pos(); }
    // ...
}

The reason, is that when the lexer has reached the end of the file, the .next() method will return null instead of a token with a position, meaning we won't get the position after the last token.

class Parser {
    // ...
    private pos(): Pos {
        if (this.done())
            return this.lexer.currentPos();
        return this.current().pos;
    }
    // ...
}

The parser does not need to keep track of index, line and col as those are stored in the tokens. The token's position is prefered to the lexer's.

Also like the lexer, we'll have a .test() method in the parser, which will test for token type rather than strings or regex.

class Parser {
    // ...
    private test(type: string): bool {
        return !this.done() && this.current().type === type;
    }
    // ...
}

When testing, we first check that we have not reach the end. Either we have to do that here, or the caller will have to write something like !this.done() && this.test(...), and it's easy to do it here.

We'll also want a method for reporting errors.

class Parser {
    // ...
    private report(pos: Pos, msg: string) {
        console.log(`Parser: ${msg} at ${pos.line}:${pos.col}`);
    }
    // ...
}

Operands

Operands are the individual parts of an operation. For example, in the math expression a + b, (would be + a b in the input language), a and b are the operands, while + is the operator. In the expression a + b * c, the operands are a, b and c. But in the expression a * (b + c), the operands of the multiply operation are a and (b + c). (b + c) is an operands, because it is enclosed on both sides. This is how we'll define operands.

We'll make a public method in Parser called parseOperand.

class Parser {
    // ...
    public parseOperand(): Expr {
        const pos = this.pos();
        if (this.test("int")) {
            const value = this.current().intValue;
            this.step();
            return { kind: { type: "int", value }, pos };
        }
        this.report(pos "expected expr");
        this.step();
        return { kind: { type: "error" }, pos };
    }
    // ...
}

Integer

Parsing an integer is a 1:1 translation between the integer token and an integer expression.

type ExprKind =
    // ...
    | { type: "int", value: number }
    // ...
    ;
class Parser {
    // ...
    public parseOperand(): Expr {
        // ...
        if (this.test("int")) {
            const value = this.current().intValue;
            this.step();
            return { kind: { type: "int", value }, pos };
        }
        // ...
    }
    // ...
}