courses/compiler/chapter_3.md
Simon From Jakobsen 0528b6d80d prefix expressions
2024-09-13 13:02:29 +00:00

15 KiB

3 Parser

In this chaper I'll show how I would make a parser.

A parser, in addition to our lexer, transforms the input program as text, meaning an unstructured sequence of characters, into a structered representation. Structured meaning the representation tells us about the different constructs such as if statements and expressions.

3.1 Abstract Syntax Tree AST

The result of parsing is a tree structure representing the input program.

This structure is a recursive acyclic structure storing the different parts of the program.

This is how I would define an AST data type.

type Stmt = {
    kind: StmtKind,
    pos: Pos,
};

type StmtKind =
    | { type: "error" }
    // ...
    | { type: "let", ident: string, value: Expr }
    // ...
    ;

type Expr = {
    kind: ExprKind,
    pos: Pos,
};

type ExprKind =
    | { type: "error" }
    // ...
    | { type: "int", value: number }
    // ...
    ;

Both Stmt (statement) and Expr (expression) are polymorphic types, meaning an expression, for example, can be either an addition operation containing 2 inner expressions or an integer expression containing the integer value, etc. This can also be implemented with classes and sub classes.

For both Stmt and Expr there's an error-kind. This makes the parser simpler, as we won't need to manage parsing failures differently than successful parslings.

3.2 Consumer of lexer

To start, we'll implement a Parser class, which for now is simply a consumer of a token iterater, meaning the lexer. In simple terms, whereas the lexer is a transformation from text to tokens, the parser is a transformation from token to an AST, except that the parser is not an iterator.

class Parser {
    private currentToken: Token | null;

    public constructor(private lexer: Lexer) {
        this.currentToken = lexer.next();
    }
    // ...
    private step() { this.currentToken = this.lexer.next() }
    private done(): bool { return this.currentToken == null; }
    private current(): Token { return this.currentToken!; }
    // ...
}

This implementation should look familiar compared to the lexer. We use the currentToken as a 'buffer', and then just use the .next() on the lexer.

Just as the lexer, we'll have a .pos() method, returning the current position.

For convenience, although there are other ways of doing it, we'll implement another public method on Lexer, which will return the lexer's current position.

class Lexer {
    // ...
    public currentPos(): Pos { return this.pos(); }
    // ...
}

The reason, is that when the lexer has reached the end of the file, the .next() method will return null instead of a token with a position, meaning we won't get the position after the last token.

class Parser {
    // ...
    private pos(): Pos {
        if (this.done())
            return this.lexer.currentPos();
        return this.current().pos;
    }
    // ...
}

The parser does not need to keep track of index, line and col as those are stored in the tokens. The token's position is prefered to the lexer's.

Also like the lexer, we'll have a .test() method in the parser, which will test for token type rather than strings or regex.

class Parser {
    // ...
    private test(type: string): bool {
        return !this.done() && this.current().type === type;
    }
    // ...
}

When testing, we first check that we have not reach the end. Either we have to do that here, or the caller will have to write something like !this.done() && this.test(...), and it's easy to do it here.

We'll also want a method for reporting errors.

class Parser {
    // ...
    private report(msg: string, pos = this.pos()) {
        console.log(`Parser: ${msg} at ${pos.line}:${pos.col}`);
    }
    // ...
}

3.3 Operands

Operands are the individual parts of an operation. For example, in the math expression a + b, (would be + a b in the input language), a and b are the operands, while + is the operator. In the expression a + b * c, the operands are a, b and c. But in the expression a * (b + c), the operands of the multiply operation are a and (b + c). (b + c) is an operands, because it is enclosed on both sides. This is how we'll define operands.

We'll make a public method in Parser called parseOperand.

class Parser {
    // ...
    public parseOperand(): Expr {
        const pos = this.pos();
        // ...
        this.report("expected expr", pos);
        this.step();
        return { kind: { type: "error" }, pos };
    }
    // ...
}

3.3.1 Identifiers and literals

Identifiers and literals (integers, strings) are single token constructs, meaning the parsing consists of translating a token into an ast-node with the value.

type ExprKind =
    // ...
    | { type: "ident", value: string }
    | { type: "int", value: number }
    | { type: "string", value: string }
    // ...
    ;
class Parser {
    // ...
    public parseOperand(): Expr {
        // ...
        if (this.test("ident")) {
            const value = this.current().identValue;
            this.step();
            return { kind: { type: "ident", value }, pos };
        }
        if (this.test("int")) {
            const value = this.current().intValue;
            this.step();
            return { kind: { type: "int", value }, pos };
        }
        if (this.test("string")) {
            const value = this.current().stringValue;
            this.step();
            return { kind: { type: "string", value }, pos };
        }
        // ...
    }
    // ...
}

3.3.2 Group expressions

A group expression is an expression enclosed in parenthesis, eg (1 + 2). Because the expression is enclosed, meaning starts with a (-token and ends with a )-token, we will treat is like an operand.

type ExprKind =
    // ...
    | { type: "group", expr: Expr }
    // ...
    ;

If we find a (-token in .parseOperand(), we know that we should parse a group expression. We do this by ignoring the (-token, parsing an expression using .parseExpr() and checking that we find a )-token afterwards.

class Parser {
    // ...
    public parseOperand(): Expr {
        // ...
        if (this.test("(")) {
            this.step();
            const expr = this.parseExpr();
            if (!this.test(")")) {
                this.report("expected ')'");
                return { kind: { type: "error" }, pos };
            }
            this.step();
            return { kind: { type: "group", expr }, pos };
        }
        // ...
    }
    // ...
}

If we do not find the closing )-token, we report an error and return an error expression.

3.3.3 Block, if and loop operands

We want to be able to use blocks, if and loop constructs as expressions.

Example:

let temperature_feeling = if > temperature 20 { "hot" } else { "cold" };

Each construct will have their own .parse...()-method, so we'll just look for the first {-, if-, or loop-token and call the relevant method.

class Parser {
    // ...
    public parseOperand(): Expr {
        // ...
        if (this.test("{"))
            return this.parseBlock();
        if (this.test("if"))
            return this.parseIf();
        if (this.test("loop"))
            return this.parseLoop();
        // ...
    }
    // ...
}

3.4 Postfix operators

Postfix operations are expressions were the operators come after the subject expression. This includes field expressions (eg. subject.field), index expressions (eg. subject[index]) and call expressions (eg. subject(...args)).

A notable detail, is that postfix operations are chainable, eg. subject[index].field is valid, likewise with subject.method(arg) and matrix[y][x].

We'll make a method .parsePostfix() to parse postfix operators.

class Parser {
    // ...
    public parsePostfix(): Expr {
        let subject = this.parseOperand();
        while (true) {
            const pos = this.pos();
            // ...
            break;
        }
        return subject;
    }
    // ...
}

We start by parsing an operand. Then we enter a loop, which runs until we no longer find any relevant operator tokens. When we parse a postfix expression, the subject will be replaced with the new parsed expression.

Notice we don't define pos at the start, but after we've parsed the subject. That's because we want pos to the reflect the start of the postfix operator, not the start of the subject.

3.4.1 Field expression

A field expression is for accessing fields on an object, and consists of a .-token and an identifier, eg. .field.

type ExprKind =
    // ...
    | { type: "field", subject: Expr, value: string }
    // ...
    ;
class Parser {
    // ...
    public parsePostfix(): Expr {
        // ...
        while (true) {
            // ...
            if (this.test(".")) {
                this.step();
                if (!this.test("ident")) {
                    this.report("expected ident");
                    return { kind: { type: "error" }, pos };
                }
                const value = this.current().identValue;
                this.step();
                subject = { kind: { type: "field", subject, value }, pos };
                continue;
            }
            // ...
        }
        // ...
    }
    // ...
}

If we find a .-token, we step over it, and make sure that we've hit an identifier. We save the identifier value and step over the identifier. Then we replace subject with a new field expression containing the previous subject value. Then we continue to look for the next postfix operator.

3.4.2 Index expression

An index operation consists of the subject and an index. The index is an expression, and it is contained in [- and ]-tokens, eg. subject[value].

type ExprKind =
    // ...
    | { type: "index", subject: Expr, value: Expr }
    // ...
    ;
class Parser {
    // ...
    public parsePostfix(): Expr {
        // ...
        while (true) {
            // ...
            if (this.test("[")) {
                this.step();
                const value = this.parseExpr();
                if (!this.test("]") {
                    this.report("expected ']'");
                    return { kind: { type: "error" }, pos };
                }
                this.step();
                subject = { kind: { type: "index", subject, value }, pos };
                continue;
            }
            // ...
        }
        // ...
    }
    // ...
}

If we find a [-token, we parse the index part exactly the same way, we parse a group expression.

3.4.3 Call expression

A call expression is like an index expression, except that it uses ( and ) instead of [ and ] and that there can be 0 or more expressions (arguments or args) inside the ( and ). The arguments are seperated by ,.

type ExprKind =
    // ...
    | { type: "call", subject: Expr, args: Expr[] }
    // ...
    ;
class Parser {
    // ...
    public parsePostfix(): Expr {
        // ...
        while (true) {
            // ...
            if (this.test("(")) {
                this.step();
                let args: Expr[] = [];
                if (!this.test(")") {
                    args.push(this.parseExpr());
                    while (this.test(",")) {
                        this.step();
                        if (this.test(")"))
                            break;
                        args.push(this.parseExpr());
                    }
                }
                const value = this.parseExpr();
                if (!this.test(")") {
                    this.report("expected ')'");
                    return { kind: { type: "error" }, pos };
                }
                this.step();
                subject = { kind: { type: "call", subject, args }, pos };
                continue;
            }
            // ...
        }
        // ...
    }
    // ...
}

Similarly to index epxressions, if we find a (-token, we step over it, parse the arguments, check for a ) and replace subject with a call expression containing the previous subject.

When parsing the arguments, we start by testing if we've reached a ) to check if there are any arguments. If not, we parse the first argument. The consecutive arguments are all preceded by a ,-token. There we test or ,, to check if we should keep parsing arguments. After checking for a seperating ,, we check if we've reached a ) and break if so. This is to allow for trailing comma.

func(
    a,
    b, // trailing comma
)

3.5 Prefix expressions

Contrasting postfix expressions, prefix expression are operations where the operator comes first, then the operands are listed. In some languages, operations such as negation (eg. -value) and not-operations (eg. !value) are prefix operations. In the language we're making, all binary and unary arithmetic operations are prefix. This includes both expressions with a single operand, such as not (eg. not value), but also expressions with 2 operands, such ass addition (eg. + a b) and equation (eg. == a b).

This is because infix operators (eg. a + b) makes parsing more complicated, as it requires reasoning about operator precedence, eg. why 2 + 3 * 4 != (2 + 3) * 4.

Operations with 1 operand are called unary expression. Operations with 2 are called binary expressions.

type ExprKind =
    // ...
    | { type: "unary", unaryType: UnaryType, subject: Expr }
    | { type: "binary", binaryType: BinaryType, left: Expr, right: Expr }
    // ...
    ;

type UnaryType = "not" /*...*/;
type BinaryType = "+" | "*" | "==" /*...*/;
class Parser {
    // ...
    public parsePrefix(): Expr {
        const pos = this.pos();
        // ...
        return this.parsePostfix();
    }
    // ...
}

We again get the position immediately, because the operation, eg. + a b, starts at the first +-token.

If we don't find any operators, we proceed to try to parse a postfix expression.

3.5.1 Unary expressions

class Parser {
    // ...
    public parsePrefix(): Expr {
        // ...
        if (this.test("not")) {
            this.step();
            const subject = this.parsePrefix();
            return { kind: { type: "unary", unaryType: "not", subject }, pos };
        }
        // ...
    }
    // ...
}

If we find a not-token, we ignore it, parse a prefix expression recursively, and return a unary expression with the subject and unary type.

3.5.2 Binary expressions

class Parser {
    // ...
    public parsePrefix(): Expr {
        // ...
        if (this.test("+")) {
            this.step();
            const left = this.parsePrefix();
            const right = this.parsePrefix();
            return { kind: { type: "binary", binaryType: "+", left, right }, pos };
        }
        // ...
    }
    // ...
}

Just as with unary, if we find a +-token, we ignore it and parse prefix expression recursively. Then we parse the second operand, by parsing another prefix expressions. And then we return a binary expression with the left and right operands and the binary type.

3.6 Expressions

Lastly for expressions, we'll make a method .parseExpr() for parsing an expression.

class Parser {
    // ...
    public parseExpr(): Expr {
        return this.parsePrefix();
    }
    // ...
}

The method just proceeds to try and parse a prefix expression.

3.7 If