Creating your own language (Part 1)

Hello,

I intend to write a serie of articles explaining how to implement a programming language, may it be functionnal, imperative, object oriented, statically or dynamically typed. We will implement it in pseudo code. For learning purposes, we will implement everything by hand (Bison and Flew does a lot of work, and we don't want to assume this work done automagically).

Theory

The first thing to do is lexical analysis (done by a Lexer): when you write something like a simple line like this:

print(3 + 4)

your computer see this:

0000000 7270 6e69 2874 2033 202b 2934 000a
000000d

So we have to translate these datas in a way our computer will feel better. We want these characters to be understood at a higher abstraction layer: distinguish identifiers, operators, and numbers. Our same line will be translated in:

[ IDENTIFIER("print"), O_PARENTHESIS, NUMBER(4), OP_PLUS, NUMBER(3), C_PARENTHESIS ]
(Just assume this is a kind of enumeration), and this is the only lexer's job.

In our example

Usually, we express tokens as regular expression (don't expect this to be always true, Shell is an exception, and cannot express its token with a regex). These expression are simply written in the bison/flex configuration file, and the lexer is generated.

Because start from scratch, it's easier to have a disambiguation on the first character, so, be worry that

Here are usual tokens (I will express my grammar using an EBNF syntax)

Identifier ::= [a-zA-Z][a-zA-Z0-9]*
Number  ::= [1-9][0-9]* /* '-' is considered as a unary operator */
Plus    ::= '+'
Minus   ::= '-'
Div     ::= '/'
Mul     ::= '*'
Modulo  ::= '%'
OParen  ::= '('
CParen  ::= ')'
Eq      ::= '='
Neq     ::= '!='
Less    ::= '<'
...

Since we wrote our tokens grammar, we now have to represent our text input (user input or script file) as a stream of characters. Our lexer will read it, character by character, and return the corresponding token.

Lex(Stream s) : Token
    if Head(s) belongs to [a-zA-Z0-9]
        return LexIdentifier(s)
    if Head(s) belongs to [+-=/%()&lt;!]
        return LexOperator(s)
    error("Unexpected character: " + head(s))

LexIdentifier(Stream s) : Token
    String str = ""
    while Head(s) belongs to [a-zA-Z0-9]
        str += Head(s)
        Pop(s)
    return Token(TOKEN_IDENTIFIER, str)

LexOperator(Stream s) : Lexer
    String s = ""
    while Head(s) belongs to [+-=/%()&lt;!]
        str += Head(s)
        Pop(s)
    foreach OperatorToken op in operators
       if str == op.str
           return Token(op.type)
    error("Invalid operator token: " + str)

const operators =
[
    { str: "(", type: O_PAREN },
    { str: "+", type: PLUS },
    { str: "&lt;=", type LESSEQUAL },
    ...
]

Now we can lex correctly our input, in the next article we will see how to assemble these tokens with a parser, and recognize our language.

If you want to create your language, I encourage you to write down the tokens that will be recognized and write the lexer. Have fun !