Compare commits

...

2 Commits

Author SHA1 Message Date
Simon From Jakobsen
2da04e73f0 add chapter 2 2024-08-23 12:35:53 +00:00
Simon From Jakobsen
13c81ed05d chapter 2 kinda sucked 2024-08-22 13:16:57 +00:00
2 changed files with 398 additions and 191 deletions

View File

@ -1,220 +1,425 @@
## 2 Specifying a language
# 2 Lexer
**You can skip this chapter and come back to it, if you need.**
In this chapter I'll propose a lexer for a compiler.
The language will at first consist of these features.
The objective of the lexer is to take text as input and ouput tokens.
- Literals
- Expressions
- Variables
- Blocks
- If-conditions
- Loops
- Function calls
- Function definitions
## 2.1 Tokens
## 2.1 Literals
A token is a piece of our input program, eg. an integer, a plus operator, etc.
Integers (ints) are numbers without decimal point.
Tokens are the result of grouping together relevant characters, but without incorporating structure. For example, if the input text is the characters `1`, then `2`, then ` `, then `+`, the tokens would be an integer token, then a plus operator token. Whitespace is usually filtered out, meaning the lexer will not emit whitespace tokens, it'll just skip over.
Strings are text enclosed in double-quotes `" "`. Backslash `\` can be used to escape `"` and invoke specialt characters `\n` and `\t` for newline and tab characters.
Tokens serve 2 purposes. 1) Categorize characters or groups of characters and 2) provide a way to retrieve the text/value the characters represent.
Booleans (bools) represent logical true and false and are the symbols `true` and `false`.
I'll propose this Typescript type for representing tokens:
```ts
type Token = {
type: string,
pos: Pos,
length: number,
identValue?: string,
intValue?: number,
stringValue?: string,
};
Null represents a non-value (null value) and is the symbol `null`.
Examples:
```
0
123
"hello"
"{ \"key\": 123 }"
true
false
null
```
## 2.2 Expressions
I'll be reusing the prefix notation expressions from chapter 1. These are also called S-expressions.
I'll be using logical expressions (`or`, `and` and `not`), comparison expressions (`==`, `!=`, `<` and `>`), arithmetic expressions (`+`, `-`, `*`, `/`) and group expressions (`(...)`).
The grammer, meaning the rules for the structure of the code, is the following:
```
expr ->
...
| "or" expr expr
| "and" expr expr
| "not" expr
| "==" expr expr
| "!=" expr expr
| "<" expr expr
| ">" expr expr
| "+" expr expr
| "-" expr expr
| "*" expr expr
| "/" expr expr
| "(" expr ")"
...
| literal
```
Notice `not` only has 1 operand.
Examples:
```
+ 1 * 2 3
+ + + 1 2 3 4
* 3 (* 3 3)
```
## 2.3 Variables
A variable is a storage container for a value.
A variable is defined using a `let`-statement. A variable is used by refering to its name. A variable can be reassigned after definition.
Variables are block-scoped, more on this later.
Grammar:
```
let -> "let" ident "=" expr ";"
assign -> ident "=" expr ";"
expr ->
...
| ident
...
```
Examples:
```
let a = 5;
a = 10;
+ a 10
```
## 2.4 Blocks
A block is code enclosed in curly-braces (`{` and `}`). They can be used as both statements and expressions. Single-line statements in a block need semicolon (`;`) at the end. Expressions can be used as statements.
If the last entry in a block is an expression and the semicolon is omitted, the block as a whole yields that value.
Grammar:
```
block -> "{" statements expr:? "}"
expr ->
...
| block
...
```
Examples:
```
{}
{
let a = 5;
}
let a = {
let b = 5;
+ b 5
type Pos = {
index: number,
line: number,
col: number,
};
```
## 2.5 If-statements
The `type` field contians the *token type*, eg. `"int"`, `"+"`, etc. The `pos` field is used primarily for error messages and with `length` for retrieving the raw text. The `identValue` and 2 other value fields contain the values of tokens when relevant, for example with integers we're also interested in the value.
An `if`-statement has a condition and a block it runs, if the condition evaluates to `true` ("truthy"), and an optional `else`-block, if it the condition evaluates to `false`.
## 2.2 Transformation iterator
Grammar:
```
if -> "if" expr block ("else" block):?
The lexer I propose is implemented as a 'transformation' of one iterator, a text character iterator, into a token iterator.
expr ->
...
| if
...
```
Let's start with the lexer code from chapter 1.
I've added `if` to `expr`, because, in addition to a statement, I want to be able to use `if` as an expression just like blocks.
Examples:
```
let a = 5;
if == b 5 {
a = 5;
}
let value = if > a b { c } else { d };
```
## 2.6 Loops
`loop`-statements are blocks that run continually (from top to bottom, then from the top again) until a `break`-statement is evaluated.
Loops can also be expressions, as a value can be supplied when breaking.
Grammar:
```
loop -> "loop" block
break -> "break" expr:? ";"
expr ->
...
| loop
...
```
Examples:
```
let i = 0;
let v = 1;
loop {
if not < i 10 {
break;
```ts
function lex(text: string): Token[] {
let index = 0;
const tokens: Token[] = [];
while (index < text.length) {
// ...
}
v = * v 2;
i = + i 1;
return tokens;
}
let v = loop {
break 5;
};
```
## 2.7 Function calls
Because it'll make the lexer's code a bit cleaner, I'll refactor the lexer using the iterator pattern.
Grammar:
```
expr ->
...
| expr "(" args ")"
...
```ts
class Lexer {
private index = 0;
public constructor (private text: string) {}
public next(): Token | null { /*...*/ }
}
```
Examples:
```
let a = pow(2, 3);
The difference is that before when calling `lex` you would get an array of all tokens, now you instanciate a `Lexer` and call the `.next()` function for each token, until the end-of-file (EOF).
exit();
I'll add 3 functions for iterating through characters of the text:
```ts
class Lexer {
// ...
private step() { /*...*/ }
private done(): bool { return this.index >= this.text.length; }
private current(): string { return this.text[this.index]; }
// ...
}
```
## 2.8 Function definition
The `.done()` is the same as in the while loop condition above. The `.current()` is instead of indexing directly into the text. `.step()` isn't implemented yet, that's because I also want line and column number in the tokens. To do this I'll add `line` and `column` as properties on `Lexer` in addition to `index`:
Grammar:
```
fn -> "fn" ident "(" params ")" block
return -> "return" expr:? ";"
```ts
class Lexer {
private index = 0;
private line = 1;
private col = 1;
// ...
}
```
Line and column numbers both start at one. Text in a file goes from left to right, top to bottom in that order. The column number should therefore increase when we step, until we step over a newline, then the line number should be incremented and column number reset to `1`. The index should just increment as we step. In case the lexer is done (hit the end of the file, usually called "EOF"), we won't increment anything.
```ts
class Lexer {
// ...
private step() {
if (this.done())
return;
if (this.current() === "\n") {
this.line += 1;
this.col = 1;
} else {
this.col += 1;
}
this.index += 1;
}
// ...
}
```
I'll also propose a method for optaining the current position:
```ts
class Lexer {
// ...
private pos(): Pos {
return {
index: this.index,
line: this.line,
col: this.col,
};
}
// ...
}
```
And a method for creating valueless tokens:
```ts
class Lexer {
// ...
private token(type: string, pos: Pos): Token {
return {
index: this.index,
line: this.line,
col: this.col,
};
}
// ...
}
```
And a method for testing/matching the `.current()` against a regex pattern or string:
```ts
class Lexer {
// ...
private test(pattern: RegExp | string): Token {
if (typeof pattern === "string")
return this.current === pattern;
else
return pattern.test(this.current);
}
// ...
}
```
Lastly I'll make `.next()` return `null` when done and handle whenever an unrecognized/illegal character is reached.
```ts
class Lexer {
// ...
public next(): Token | null {
if (this.done())
return null;
const pos = this.pos();
// ...
console.error(`Lexer: illegal character '${this.current()}' at ${pos.line}:${pos.col}`);
this.step();
return next();
}
// ...
}
```
When we've checked, we haven't hit EOF yet, we get the current position. In `.next()` we should return after each valid token. This means, when we hit the end, we could not make any token, therefore we report an error. After reporting, we step over the character, essentially ignoring it, and call `.next()` recursively to start again. We don't need to stop the compilation, just because we hit an invalid character.
The scaffolding is now complete.
## 2.3 Skipping whitespace
We don't need to know anything about whitespace, so we'll skip over it without making a token. This is what I'll propose adding to `.next()` to make that happen:
```ts
class Lexer {
// ...
public next(): Token | null {
if (this.done())
return null;
const pos = this.pos();
if (this.test(/[ \t\n]/)) {
while (!this.done() && this.test(/[ \t\n]/))
this.step();
return next();
}
// ...
console.error(
`Lexer: illegal character '${this.current()}'`
+ ` at ${pos.line}:${pos.col}`,
);
this.step();
return next();
}
// ...
}
```
We test if the current character matches the regex `[ \t\n]`, meaning space, tab and newline characters. And then we `.step()` over all the whitespace characters. Then we return a call to `.next()`, instead of just continuing, because we need to check `.done()` again.
## 2.4 Comments
Same as in Python, Bash and so many others, comments in this language is a hashtag `#` and then all the characters afterwards is ignored, until a newline `\n` is reached.
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test("#")) {
while (!this.done() && !this.test("\n"))
this.step();
return next();
}
// ...
}
// ...
}
```
## 2.5 Identifiers and keywords
Identifiers, also called names or symbols, are what we use to refer to functions, variables, etc. Keywords are the special words we use in specialt syntax, such as `if`-statements.
To lex identifiers and keywords, we'll look for identifier characters, find all the characters in an identifier, save the text value.
Identifers can start with a letter or an underscore `[a-zA-Z_]`, but not numbers `[0-9]`, because that would be hard to implement. Every character after the first may also contain numbers, ie. `[a-zA-Z_0-9]`.
Lastly, we check if the identifier is in fact not an identifier, but one of the hardcoded keywords.
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test(/[a-zA-Z_]/)) {
let value = "";
while (!this.done() && this.test(/[a-zA-Z0-9_]/)) {
value += this.current();
this.step();
}
switch (value) {
case "if":
return this.token("if", pos);
case "else":
return this.token("else", pos);
default:
return { ...this.token("ident", pos), identValue: value };
}
}
// ...
}
// ...
}
```
Again we use the if-while pattern to match multiple characters. Each character is appended to the `value` variable, so that we have the entire value at the end. If then, the value matches one of the keywords, a keyword specific token is returned, else an `ident` token is returned and the `identValue` field is added on.
## 2.6 Integers
Integers, just like identifers, are sequences of characters. Unlike identifiers, there are no keywords (keynumbers) we need to be aware of. An integer, in this language, is a sequence consisting of and starting with any number characters `[0-9]`. (Sidenote: In most other languages, a base 10 integer is __either__ a `0` character or a sequence starting with any number __except zero__ `[1-9]`, after the first character any number is allowed.)
After matching the integer, we also parse the string representing a number into the number value itself.
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test(/[0-9]/)) {
let textValue = "";
while (!this.done() && this.test(/[0-9]/)) {
textValue += this.current();
this.step();
}
return { ...this.token("int", pos), intValue: parseInt(textValue) };
}
// ...
}
// ...
}
```
This implementation is a bit naive, but I've prioritized simplicity.
## 2.7 Strings
Strings, like identifiers and integers, are sequences of characters represent a value. Like integers, we don't need to check for keywords.
Strings can contain special/escaped characters such as `\"`, `\\`, `\n`. For `\\` and `\"`, the resulting value should just contain the character after the backslash `\`. For `\n`, `\t` and `\0` the resulting value should contain a newline, tab and null-byte character respectively. The implementation will make this transformation while matching the string.
The resulting value also does not contain the surrounding double quotes `" "`
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test("\"")) {
this.step();
let value = "";
while (!this.done() && !this.test("\"")) {
if (this.test("\\")) {
this.step();
if (this.done())
break;
value += {
"n": "\n",
"t": "\t",
"0": "\0",
}[this.current()] ?? this.current();
} else {
value += this.current();
}
this.step();
}
if (this.done() || !this.test("\"")) {
console.error(
`Lexer: unclosed/malformed string`
+ ` at ${pos.line}:${pos.col}`,
);
return this.token("error", pos);
}
this.step();
return { ...this.token("string", pos), stringValue: value };
}
// ...
}
// ...
}
```
A string starts with a double quote `"`. This character is ignored. We then look at every character up until an __unescaped__ double quote `"`, meaning a string may contain escaped double quote characters `\"`. When the loop is done, we check whether we've hit a closing double quote `"` or if we've reached EOF. If we've reached EOF, we emit an error and return a special `error` token. If we've reached the enclosing quote `"`, we ignore and step over it and return a string token.
While looking at each character, we check if it's a backslash `\`. If so, we ignore the escaping backslash and handle the next character like either specialt characters, eg. `n` should be a newline character, or just an escaped character, eg. an escaped double quote.
## 2.8 Static tokens
Operators and punctionation such as plus `+`, left curlybrace `{` and semicolon `;` are single character static tokens. Operators such as double equal `==` are multi character static tokens.
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test(/[\+\{\};]/)) {
const first = this.current();
this.step();
if (first === "=" && !this.done() && this.test("=")) {
this.step();
return this.token("==", pos);
}
return this.token(first, pos);
}
// ...
}
// ...
}
```
The clause checks for any one of the characters. For single character tokens, it saves the character for use as the token type. Then it `.step()`s and returns a token with the character as the token type. For multi character tokens, we check the first character again for specific characters, and then match another token if necessary.
Note: The reason `+`, `{`, `}` are escaped, is because they are special regex characters. In Javascript all non-letters can safely be escaped and all letters are safe to use.
The above code can also be writte like this:
```ts
class Lexer {
// ...
public next(): Token | null {
// ...
if (this.test("=")) {
this.step()
if (!this.done() && this.test("=")) {
this.step();
return this.token("==", pos);
}
return this.token("=", pos);
}
if (this.test("{")) {
this.step();
return this.token("{", pos);
}
// ...
}
// ...
}
```
The former is shorter the latter is more readable.
## 2.9 Testing
To test the lexer, give it some text and observe the result and errors.
```ts
const text = `
a1 123 +
# comment
"hello"
"escaped\"\nnewline"
`;
const lexer = new Lexer(text);
let token = lexer.next();
while (token !== null) {
console.log(`Lexed ${token}`);
token = lexer.next();
}
```
## 2.10 Exercises
1. Implement the operators: `-`, `*`, `/`, `(`, `)`, `[`, `]`, `!=`, `<`, `>`, `<=` and `>=`.
2. Implement the keywords: `true`, `false`, `null`, `or`, `and`, `not`, `loop`, `break`, `let`, `fn` and `return`.
3. \* Implement single line comments using `//` and multiline comments using `\*` and `*\` (\*\* extra points if multiline comments can be nested, eg. `/* ... /* ... */ ... */`).
4. \* Reimplement integers such that integers are either `0` or start with `[1-9]`.
5. \* Implement Hex integers, eg. `0xAA15`, using base 16.

View File

@ -6,6 +6,8 @@ This course will walk you through how to build a interpreter/compiler.
The first chapter will show how to make a simple polish notation calculator.
The latter chapters will show how to specify and implement a complete programming language.
I will be using Javascript and Typescript as reference implementation language. This is not required. I'd highly recommend you try to implement the concepts in another language, so that you understand the concepts, not just copy the source code.
## Exercises
You should attempt and should be able to do all exercises not marked \*.