# Creating your own language (Part 1)

Hello,

I intend to write a serie of articles explaining how to implement a programming language, may it be functionnal, imperative, object oriented, statically or dynamically typed. We will implement it in pseudo code. For learning purposes, we will implement everything by hand (Bison and Flew does a lot of work, and we don't want to assume this work done automagically).

## Theory

The first thing to do is lexical analysis (done by a Lexer): when you write something like a simple line like this:

``print(3 + 4)``

``````0000000 7270 6e69 2874 2033 202b 2934 000a
000000d``````

So we have to translate these datas in a way our computer will feel better. We want these characters to be understood at a higher abstraction layer: distinguish identifiers, operators, and numbers. Our same line will be translated in:

``[ IDENTIFIER("print"), O_PARENTHESIS, NUMBER(4), OP_PLUS, NUMBER(3), C_PARENTHESIS ]``
(Just assume this is a kind of enumeration), and this is the only lexer's job.

## In our example

Usually, we express tokens as regular expression (don't expect this to be always true, Shell is an exception, and cannot express its token with a regex). These expression are simply written in the bison/flex configuration file, and the lexer is generated.

Because start from scratch, it's easier to have a disambiguation on the first character, so, be worry that

$\mbox{prefix}(\mbox{token}_{0}) \neq \mbox{prefix}(\mbox{token}_{1}) \neq ... \neq \mbox{prefix}(\mbox{token}_{i})$

Here are usual tokens (I will express my grammar using an EBNF syntax)

``````Identifier ::= [a-zA-Z][a-zA-Z0-9]*
Number  ::= [1-9][0-9]* /* '-' is considered as a unary operator */
Plus    ::= '+'
Minus   ::= '-'
Div     ::= '/'
Mul     ::= '*'
Modulo  ::= '%'
OParen  ::= '('
CParen  ::= ')'
Eq      ::= '='
Neq     ::= '!='
Less    ::= '<'
...``````

Since we wrote our tokens grammar, we now have to represent our text input (user input or script file) as a stream of characters. Our lexer will read it, character by character, and return the corresponding token.

``````Lex(Stream s) : Token
return LexIdentifier(s)
return LexOperator(s)

LexIdentifier(Stream s) : Token
String str = ""
Pop(s)

LexOperator(Stream s) : Lexer
String s = ""
Pop(s)
foreach OperatorToken op in operators
if str == op.str