Everything About Felys

Felys is a dynamically strongly typed scripting language with a syntax similar to Rust. It provides minimal features to support data grouping with dynamic length, i.e., tuples and lists. The entire project is intended to be a toy language for demonstration purposes only. I have never considered turning this into a real product; instead, it serves as a tribute to Elysia from Honkai Impact 3rd.

Feel to try Felys using the online playground.

Topics

In short, there are two main topics that worth sharing: the design principle and the parser generator. The former covers all the considerations involved in designing Felys, while the latter discusses how to implement a PEG/Packrat parser along with a DFA-driven lexer. Some parts may be a bit academic, but I believe this book should be accessible to any ECE/CS undergraduate beyond their first year.

Legal Statement

Other properties and any right, title, and interest thereof and therein (intellectual property rights included) not derived from Honkai Impact 3rd belong to their respective owners.

Quickstart

The playground also has some sample programs.

Literal

Felys has built-in support for float, integer, string, tuple, and list. Their underlying implementation uses f64, isize, String, Vec, and Vec in Rust standard library.

int = 42;
float = 3.14;
string = "elysia";
tuple = ("elysia", 11.11);
list = [1, 2, 3, 4, 5];

Assignment

Beyond standard assignment, you also unpack a tuple and the assign it. Felys also had syntactic sugar like +=, -=, etc.

(name, birthday) = ("elysia", 11.0);
birthday += 0.11;

Operator

Just like any other languages, Felys has arithmatic, comparsion, and logical operator. Be aware that Felys is strong typed, which means that things like 1 and 1.0, are not same and 1 + 1.0 does not evaluate.

sum = 1 + 1;
conjunction = true and true or false;
equality = 1 == 1;

Flow control

Felys flow control is similar to Rust, where most of them have a return value.

Condition

Unlike condition in other languages, else in Felys allows any type of expression to follow by.

one = if true {
    1
} else 0;

one = if true {
    1
} else loop {
    break 1;
}

Loops

There are three types of loops using keywords loop, while, and for, along with break and continue. The break keyword appeared in loop can carry a return value. for loop will go through an iterable, i.e. list.

one = loop {
    break 1;
};

while true {
    if one {
        break;
    }
}

for x in [1, 2, 3] {
    if x == 2 {
        break;
    }
}

Block

The last statement of a block is also the return value of the block. All statements before it must not have a return value. You can to use ; to make them void, and the best practice is to always add the ; except for expression ends with } and returns void.

one = { 1 };

Statement

If semicolon shows up after an expression, this expression will have void return value, i.e. no return value. Most expressions have a return value except for assignment, for loop, while loop, break, continue, and return.

void = { 1; }
one = 1;

Main

All statements in a Felys program must not have return values. IThe program has a default return value void, but you can return anything using the return keyword. This would also early terminates the program, and is also the only interface to output something.

return __author__;

The program usually spawn another thread to timeout the runtime, becuase it is designed such that it can run safely on the website. The callstack is also limited to protect the server. Felys is slow because the backend just walks through the syntax tree and copy-pastes everything, which leads to lots of overhead.

Design Principle

A programming language is a product, so we need to understand what makes up a programming language. Understanding this will explain why some features are designed the way they are. This section will serve as important guidance when implementing the language.

Programming Language

Before we get into any details, it is crucial to understand differences between languages.

Compilation vs Interpretation

The key to determining whether a language is compiled or interpreted is to check if it is eventually translated into machine code. Machine code refers to the executable that runs directly without requiring an external runtime. Traditional languages like C/C++ and modern ones like Rust and Go are well-known examples.

Interpreted languages always require a runtime to execute the code, even if the source code goes through a compilation process. Python, Java, and JavaScript fall into this category. In terms of performance, compiled languages are usually much faster than interpreted ones, and some can reach speeds close to assembly language. However, Java can be an exception, as it is statically typed and benefits from advanced runtime optimizations, making its performance highly competitive.

Interpreted languages are usually more flexible and can easily integrate libraries written in compiled languages. For example, most machine learning libraries in CPython are actually written in C and C++, and CPython is just the interface.

Explicit Memory Management vs Garbage Collection

Many programming languages use garbage collection to hide the complexity of memory management from programmers, although the implementations can vary significantly. For example, Java relies on the JVM (Java Virtual Machine), which manages its own stack, heap, program counter, and other resources. In contrast, Go uses a built-in runtime and does not rely on an external platform.

C, C++, and Rust are some popular languages that do not require a garbage collection runtime. The former usually requires manual memory management (although C++ supports smart pointers), while the latter uses a borrow checker and lifetime analysis to ensure memory safety at compile time. Explicit memory management is usually faster and more suitable for embedded systems due to its determinism.

Functional vs Imperative

Functional programming is what many CS students learn in their first year, and it has been trending in recent years. Pure functional programming has no side effects, and everything is immutable. One benefit is that it makes logic clear and easier to debug, although it is not very beginner friendly. Languages of this type, such as Haskell and Racket, are commonly used in academia.

In contrast, imperative languages focus on flow control, introducing constructs like if-else clauses and loops. They are generally more readable for everyday use. To take advantage of functional programming, imperative languages often include partial support for it. This allows programmers to use functional thinking to write logic and easily assemble components to build maintainable and complex products.

Static Typing vs Dynamic Typing

Static typing means that a variable is bound to a specific data type at compile time and cannot be reassigned to a value of a different type. In contrast, dynamic typing allows variables to hold values of different types during runtime without explicit declarations. If we look deep enough into how computers work, we find that data types are an abstraction used by higher-level languages. At the hardware level, data is just a sequence of bits. The processor has no concept of types and only executes instructions. It’s the assembler and compiler that ensure data is interpreted correctly, e.g., IEEE 754. When we arrange data in contiguous memory and define how it should be interpreted, we create data structures. This close relationship between memory layout and type interpretation makes static typing natural in many compiled languages, where knowing the type is essential for generating correct machine code.

As interpreted languages became more popular, dynamic typing emerged as a more flexible alternative as the runtime can handle types dynamically. One major benefit of dynamic typing is its flexibility and speed during development, especially in scripting and prototyping. But for large projects, developers often reintroduce type systems, e.g., Python or TypeScript.

Strong Typing vs Weak Typing

The key difference between strong and weak typing is type coercion. Some languages, such as JavaScript and C/C++, automatically convert data types when performing operations. For example, in C, the expression 1 + '1' is valid because the character '1' is implicitly converted to its corresponding ASCII value. It's important not to confuse weak typing with operator overloading, as they are entirely different mechanisms. Weak typing can sometimes lead to unexpected behavior, so modern programmers tend to avoid relying on implicit conversions and instead write them out explicitly to improve code clarity and maintainability.

Technical Decisions

Comming soon...

Grammar File

Felys grammar file uses customized syntax to generate the parser and lexer. However, due to limitations of the parser generator, the non-terminals ident and eof are still written by hand. Their implementations can be found in the source code.

{
    use crate::ast::*;
    #[allow(unused)]
    use std::rc::Rc;
}

grammar -> { Grammar }:
    / stmts=(stmt=stmt $)* #eof { Grammar(stmts) }
    ;

@(memo)
pat -> { Pat }:
    / @(token)('_' !IB) { Pat::Any }
    / ident=ident { Pat::Ident(ident) }
    / lit=lit { Pat::Lit(lit) }
    / '(' first=pat ',' second=#pat more=(',' pat=#pat)* #')' { {
        let mut elements = more;
        elements.insert(0, second);
        elements.insert(0, first);
        Pat::Tuple(elements)
    } }
    ;

stmt -> { Stmt }:
    / expr=expr ';' { Stmt::Semi(expr) }
    / expr=expr { Stmt::Expr(expr) }
    / ';' { Stmt::Empty }
    ;

block -> { Block }:
    / '{' stmts=stmt* #'}' { Block(stmts) }
    ;

@(memo)
expr -> { Expr }:
    / assignment=assignment
    / disjunction=disjunction
    / block=block { Expr::Block(block) }
    / @(token)('break' !IB) expr=expr? { Expr::Break(expr.map(Rc::new)) }
    / @(token)('continue' !IB) { Expr::Continue }
    / @(token)('for' !IB) pat=#pat @(token)(#'in' !IB) expr=#expr block=#block {
        Expr::For(pat, expr.into(), block)
    }
    / @(token)('if' !IB) expr=#expr block=#block otherwise=otherwise? {
        Expr::If(expr.into(), block, otherwise.map(Rc::new))
    }
    / @(token)('loop' !IB) block=#block { Expr::Loop(block) }
    / @(token)('return' !IB) expr=expr? { Expr::Return(expr.map(Rc::new)) }
    / @(token)('while' !IB) expr=#expr block=#block { Expr::While(expr.into(), block) }
    ;

otherwise -> { Expr }: @(token)('else' !IB) expr=#expr ;

assignment -> { Expr }:
    / pat=pat '=' !'=' expr=#expr { Expr::Assign(pat, AssOp::Eq, expr.into()) }
    / pat=pat '+=' expr=#expr { Expr::Assign(pat, AssOp::AddEq, expr.into()) }
    / pat=pat '-=' expr=#expr { Expr::Assign(pat, AssOp::SubEq, expr.into()) }
    / pat=pat '*=' expr=#expr { Expr::Assign(pat, AssOp::MulEq, expr.into()) }
    / pat=pat '/=' expr=#expr { Expr::Assign(pat, AssOp::DivEq, expr.into()) }
    / pat=pat '%=' expr=#expr { Expr::Assign(pat, AssOp::ModEq, expr.into()) }
    ;

disjunction -> { Expr }:
    / lhs=disjunction @(token)('or' !IB) rhs=#conjunction {
        Expr::Binary(lhs.into(), BinOp::Or, rhs.into())
    }
    / conjunction=conjunction
    ;

conjunction -> { Expr }:
    / lhs=conjunction @(token)('and' !IB) rhs=#inversion {
        Expr::Binary(lhs.into(), BinOp::And, rhs.into())
    }
    / inversion=inversion
    ;

inversion -> { Expr }:
    / @(token)('not' !IB) inversion=#inversion { Expr::Unary(UnaOp::Not, inversion.into()) }
    / equality=equality
    ;

equality -> { Expr }:
    / lhs=equality '==' rhs=#comparison { Expr::Binary(lhs.into(), BinOp::Eq, rhs.into()) }
    / lhs=equality '!=' rhs=#comparison { Expr::Binary(lhs.into(), BinOp::Ne, rhs.into()) }
    / comparison=comparison
    ;

comparison -> { Expr }:
    / lhs=comparison '>=' rhs=#term { Expr::Binary(lhs.into(), BinOp::Ge, rhs.into()) }
    / lhs=comparison '<=' rhs=#term { Expr::Binary(lhs.into(), BinOp::Le, rhs.into()) }
    / lhs=comparison '>' rhs=#term { Expr::Binary(lhs.into(), BinOp::Gt, rhs.into()) }
    / lhs=comparison '<' rhs=#term { Expr::Binary(lhs.into(), BinOp::Lt, rhs.into()) }
    / term=term
    ;

term -> { Expr }:
    / lhs=term '+' rhs=#factor { Expr::Binary(lhs.into(), BinOp::Add, rhs.into()) }
    / lhs=term '-' rhs=#factor { Expr::Binary(lhs.into(), BinOp::Sub, rhs.into()) }
    / factor=factor
    ;

factor -> { Expr }:
    / lhs=factor '*' rhs=#unary { Expr::Binary(lhs.into(), BinOp::Mul, rhs.into()) }
    / lhs=factor '/' rhs=#unary { Expr::Binary(lhs.into(), BinOp::Div, rhs.into()) }
    / lhs=factor '%' rhs=#unary { Expr::Binary(lhs.into(), BinOp::Mod, rhs.into()) }
    / unary=unary
    ;

unary -> { Expr }:
    / '+' unary=#unary { Expr::Unary(UnaOp::Pos, unary.into()) }
    / '-' unary=#unary { Expr::Unary(UnaOp::Neg, unary.into()) }
    / call=call
    ;

call -> { Expr }:
    / call=call '(' args=args? #')' {
        Expr::Call(call.into(), args.unwrap_or_default())
    }
    / primary=primary
    ;

args -> { Vec<Expr> }:
    / first=expr more=(',' expr=#expr)* { {
      let mut elements = more;
      elements.insert(0, first);
      elements
    } }
    ;

primary -> { Expr }:
    / lit=lit { Expr::Lit(lit) }
    / ident=ident { Expr::Ident(ident) }
    / '(' expr=#expr ')' { Expr::Paren(expr.into()) }
    / '(' first=#expr #',' second=#expr more=(',' expr=#expr)* #')' { {
        let mut elements = more;
        elements.insert(0, second);
        elements.insert(0, first);
        Expr::Tuple(elements)
    } }
    / '[' args=args? #']' { Expr::List(args.unwrap_or_default()) }
    / '|' params=params? #'|' expr=#expr {
        Expr::Closure(params.unwrap_or_default(), expr.into())
    }
    ;

params -> { Vec<Ident> }:
    / first=ident more=(',' ident=#ident)* { {
        let mut elements = more;
        elements.insert(0, first);
        elements
    } }
    ;

lit -> { Lit }:
    / float=FLOAT { Lit::Float(float) }
    / int=INT { Lit::Int(int) }
    / str=str { Lit::Str(str) }
    / bool=bool { Lit::Bool(bool) }
    ;

@(token)
str -> { Vec<Chunk> }: '"' chunks=chunk* #'"' ;

@(token)
chunk -> { Chunk }:
    / slice=SLICE { Chunk::Slice(slice) }
    / '\\u{' hex=#HEX #'}' { Chunk::Unicode(hex) }
    / '\\' escape=#ESCAPE { Chunk::Escape(escape) }
    ;

@(token)
bool -> { Bool }:
    / 'true' !IB { Bool::True }
    / 'false' !IB { Bool::False }
    ;

IB: [a-zA-Z0-9_] ;

@(intern) {
    INT: '0' | [1-9][0-9]* ;
    FLOAT: [1-9][0-9]* '.' [0-9]+ | '0.' [0-9]+;
    IDENT: [a-zA-Z_] IB* ;
    SLICE: [^"\\]+ ;
    HEX: [0-9a-f]* ;
    ESCAPE: ['ntr\\] ;
}

@(ws) {
    WS: [\u{20}\n\t\r]+ ;
    COMMENT: '//' [^\n]* ;
}

Parser Generator

Comming soon...

PEG/Packrat

Comming soon...

Whitespace Handling

Comming soon...

Algorithm 3.36

Finite automata come in two types: deterministic (DFA) and non-deterministic (NFA). While DFAs are generally faster, NFAs allow more flexible transitions and is more extensive. However, our main parsing strategy is based on PEG, which supports infinite backtracking. Since the lexer only needs to recognize simple lexemes and efficiency is important, DFAs are preferred for lexical analysis. Algorithm 3.36 introduced by the Dragon Book can be used to construct DFAs directly.

Introduction to Language

This is a highly academic topic and too lengthy for this book, so I will briefly explain it in plain language. In short, a formal language is a set of strings over a given alphabet, and a regular expression is a syntactic tool used to describe certain types of formal languages, specifically regular languages. If s is a symbol from an alphabet, then L(s) = {s} is a language consisting of a single string. Let c1 and c2 be languages. The core operations used in regular expressions are:

(c1)|(c2): denotes the union of c1 and c2, i.e., strings in either c1 or c2
(c1)(c2): denotes concatenation, i.e., strings formed by c1 followed by c2
(c)*: denotes the Kleene star, i.e., zero or more repetitions of strings from c
(c): parentheses are used for grouping and are semantically neutral

Note: All these are still languages, so that induction works.

Preparator

Before generating them, we need to compute followpos using nullable, firstpos, and lastpos. To make this table look nicer, I will remove all pos postfixes. Note that parenthesis is not mentioned, because it only change precedence, so we can just recursively call into it.

Node: `n`	`nullable(n)`	`first(n)`	`last(n)`
Position: `i`	`false`	`{i}`	`{i}`
Union: `c1\|c2`	`nullable(c1) \|\| nullable(c2)`	`first(c1) \| first(c2)`	`last(c2) \| last(c1)`
Concat: `c1c2`	`nullable(c1) && nullable(c2)`	`first(c1) \| first(c2) if nullable(c1) else first(c1)`	`last(c2) \| last(c1) if nullable(c2) else last(c2)`
Kleene: `c*`	`true`	`first(c)`	`last(c)`

Once we have these formulae, we can compute followpos, which can be represented as a graph G(V, E), where V is the set of positions, and E is the set of edges corresponding to followpos relationships. To compute this, we need to traverse the syntax tree. There are only two cases in which one position can be made to follow another:

Concat(c1c2): all positions in first(c2) are in follow(i) for i in last(c1)
Kleene(c*): all positions in first(c) are in follow(i) for i in last(c)

If your usecase involves constructing very large transition table, it would be helpful to cache all the reasults of nullable, firstpos, and lastpos. For Felys, the transition table are usually small, so there is no need to bring extra overhead.

Construction

Before building everything, we need to append a special pound symbol to the end of the language so that we can identify the terminal state. It is recommended to do this at the syntax tree level, rather than appending it directly to the regular expression.

Comming soon...

Code Generation

Comming soon...

Meta Syntax

This is a fully functional meta-syntax that can bootstrap the parser generator. The non-terminals action and eof are handwritten. The former requires embedding a foreign language, so it needs to maintain a brace counter, which is non-standard.

{ use crate::ast::*; }

grammar -> { Grammar }:
    / import=action? $ callables=(callable=callable $)* #eof {
        Grammar { import, callables }
    }
    ;

callable -> { Callable }:
    / decorator=decorator? name=NAME '->' ty=#action #':' rule=#rule #';' {
        Callable::Rule(decorator, name, ty, rule)
    }
    / decorator=decorator? name=NAME #':' regex=#union #';' {
        Callable::Regex(decorator, name, regex)
    }
    / decorator=decorator #'{' shared=#(name=NAME #':' regex=#union #';')+ #'}' {
        Callable::Shared(decorator, shared)
    }
    ;

decorator -> { Decorator }:
    / '@' #'(' first=#tag more=(',' tag=#tag)* #')' { {
        let mut tags = more;
        tags.insert(0, first);
        Decorator { tags }
    } }
    ;

tag -> { Tag }:
    / 'memo' { Tag::Memo }
    / 'left' { Tag::Left }
    / 'token' { Tag::Token }
    / 'intern' { Tag::Intern }
    / 'ws' { Tag::Whitespace }
    ;

rule -> { Rule }:
    / '/'? first=#alter more=('/' alter=#alter)* { {
        let mut alters = more;
        alters.insert(0, first);
        Rule { alters }
    } }
    ;

alter -> { Alter }:
    / assignments=#assignment+ action=action? { Alter { assignments, action } }
    ;

assignment -> { Assignment }:
    / name=NAME '=' item=#item { Assignment::Named(name, item) }
    / lookahead=lookahead { Assignment::Lookahead(lookahead) }
    / item=item { Assignment::Anonymous(item) }
    / '$' { Assignment::Clean }
    ;

lookahead -> { Lookahead }:
    / '&' atom=#atom { Lookahead::Positive(atom) }
    / '!' atom=#atom { Lookahead::Negative(atom) }
    ;

item -> { Item }:
    / atom=atom '?' { Item::Optional(atom) }
    / atom=atom '*' { Item::ZeroOrMore(atom) }
    / eager='#'? atom=atom '+' { Item::OnceOrMore(eager.is_some(), atom) }
    / eager='#'? atom=atom { Item::Name(eager.is_some(), atom) }
    ;

atom -> { Atom }:
    / decorator=decorator? '(' rule=#rule #')' { Atom::Nested(decorator, rule) }
    / string=string { Atom::String(string) }
    / name=NAME { Atom::Name(name) }
    ;

union -> { Regex }:
    / lhs=union '|' rhs=#concat { Regex::Union(lhs.into(), rhs.into()) }
    / concat=concat { concat }
    ;

concat -> { Regex }:
    / lhs=concat rhs=repeat { Regex::Concat(lhs.into(), rhs.into()) }
    / repeat=repeat { repeat }
    ;

repeat -> { Regex }:
    / inner=repeat '+' { Regex::OnceOrMore(inner.into()) }
    / inner=repeat '*' { Regex::ZeroOrMore(inner.into()) }
    / primary=primary { Regex::Primary(primary) }
    ;

primary -> { Primary }:
    / '(' regex=#union #')' { Primary::Parentheses(regex.into()) }
    / string=string { Primary::Literal(string) }
    / name=NAME { Primary::Name(name) }
    / set=set { set }
    ;

@(token)
set -> { Primary }:
    / '[' '^' set=#range+ #']' { Primary::Exclude(set) }
    / '[' set=#range+ #']' { Primary::Include(set) }
    ;

@(token)
string -> { Vec<usize> }:
    / '\'' string=#LCH+ #'\'' { string }
    ;

@(token)
range -> { (usize, usize) }:
    / start=SCH '-' end=#SCH { (start, end) }
    / ch=SCH { (ch, ch) }
    ;

@(intern, memo)
NAME: [a-zA-Z_][a-zA-Z0-9_]* ;

@(intern) {
    LCH: UNICODE | ESCAPE | [^\'\\] ;
    SCH: UNICODE | ESCAPE | [^\]\\] ;
}

@(ws) {
    WS: [\u{20}\n\t\r]+ ;
    COMMENT: '//' [^\n]* ;
}

UNICODE: '\\' 'u' '{' [0-9a-f]* '}' ;
ESCAPE: '\\' [\'\[\]ntr\\] ;

Afterword

Comming soon...

Keyboard shortcuts

Everything About Felys