Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Everything About Felys

Felys is a dynamically strongly typed scripting language with a syntax similar to Rust. It provides minimal features to support data grouping with dynamic length, i.e., tuples and lists. The entire project is intended to be a toy language for demonstration purposes only. I have never considered turning this into a real product; instead, it serves as a tribute to Elysia from Honkai Impact 3rd.

Feel to try Felys using the online playground.

Topics

In short, there are three main topics that worth sharing:

  • Design principle that covers all the considerations involved in designing Felys.
  • Implementation of a PEG/Packrat parser along with a DFA-driven lexer.
  • Neural network framework with automatic differentiation and dynamic computation graph.

Some of these are a bit academic, but I believe this book should be accessible to most EE/CS undergraduates beyond their first year.

Most of these are not done yet. Writing takes time.

Other properties and any right, title, and interest thereof and therein (intellectual property rights included) not derived from Honkai Impact 3rd belong to their respective owners.

Quickstart

The playground also has some sample programs.

Literal

Felys has built-in support for float, integer, string, tuple, list, and matrix. Their underlying implementation uses f64, isize, String, Vec, Vec, and Vec in Rust standard library.

int = 42;
float = 3.14;
string = "Elysia";
tuple = ("Elysia", 11.11);
list = [1, 2, 3, 4, 5];
matrix = [
    0.0, 0.0, 0.0;
    0.0, 0.0, 0.0;
];
learnable = <2, 3>;

Comment

Just like many programming languages, Felys uses double slash for comments:

// This is a comment

Assignment

Beyond standard assignment, you also unpack a tuple and the assign it. Felys also had syntactic sugar like +=, -=, etc.

(name, birthday) = ("Elysia", 11.0);
birthday += 0.11;

Felys also provides built-in identifier storing values in global scope. You can use rust keyword to distinguish them from user defined identifiers.

author = rust __author__;

Operator

Just like any other languages, Felys has arithmetic, comparison, and logical operator. Be aware that Felys is strong typed, which means that things like 1 and 1.0, are not same and 1 + 1.0 does not evaluate.

sum = 1 + 1;
dot = <1, 2> @ <2, 1>;
conjunction = true and true or false;
equality = 1 == 1;

Flow control

Unlike condition in other languages, else in Felys allows any type of expression to follow by.

one = if true {
    1
} else 0;

one = if true {
    1
} else loop {
    break 1;
}

There are three types of loops using keywords loop, while, and for, along with break and continue. The break keyword appeared in loop can carry a return value. for loop will go through an iterable, i.e., list.

one = loop {
    break 1;
};

while true {
    if one {
        break;
    }
}

for x in [1, 2, 3] {
    if x == 2 {
        break;
    }
}

The last statement of a block is also the return value of the block. All statements before it must not have a return value. You can to use ; to make them void, and the best practice is to always add the ; except for expression ends with } and returns void.

one = { 1 };

Statement

If semicolon shows up after an expression, this expression will have void return value, i.e. no return value. Most expressions have a return value except for assignment, for loop, while loop, break, continue, and return.

void = { 1; };
one = 1;

Neural Network

All neural network related operation must be done on matrix containing float values only. Declaration of a matrix is similar to MATLAB:

matrix = [
    0.0, 0.0, 0.0;
    0.0, 0.0, 0.0;
];

If you want a learnable parameter that is initialized randomly, you can declare it like this:

matrix = <2, 3>;

The runtime can get the correct value when evaluating this, instead of initializing it again. For unmodified programs, you can optionally load previous parameters. If you are using the playground, this can be achieved by clicking the lock icon.

Once you want to back propagate, you can call:

step loss by 0.01;

The loss must be a matrix, and the 0.01 is the learning rate. Felys implemented a SGD optimizer with momentum as backend.

Main

All statements in a Felys program must not have return values. The program has a default return value void, but you can return anything using the return keyword. This would also early terminates the program, and is also the only interface to output something.

return "Elysia";

Design Principle

A programming language is a product, so we need to understand what makes up a programming language. Understanding this will explain why some features are designed the way they are. This section will serve as important guidance when implementing the language.

Programming Language

Before we get into any details, it is crucial to understand differences between languages.

Compilation vs Interpretation

The key to determining whether a language is compiled or interpreted is to check if it is eventually translated into machine code. Machine code refers to the executable that runs directly without requiring an external runtime. Traditional languages like C/C++ and modern ones like Rust and Go are well-known examples.

Interpreted languages always require a runtime to execute the code, even if the source code goes through a compilation process. Python, Java, and JavaScript fall into this category. In terms of performance, compiled languages are usually much faster than interpreted ones, and some can reach speeds close to assembly language. However, Java can be an exception, as it is statically typed and benefits from advanced runtime optimizations, making its performance highly competitive.

Interpreted languages are usually more flexible and can easily integrate libraries written in compiled languages. For example, most machine learning libraries in CPython are actually written in C and C++, and CPython is just the interface.

Explicit Memory Management vs Garbage Collection

Many programming languages use garbage collection to hide the complexity of memory management from programmers, although the implementations can vary significantly. For example, Java relies on the JVM (Java Virtual Machine), which manages its own stack, heap, program counter, and other resources. In contrast, Go uses a built-in runtime and does not rely on an external platform.

C, C++, and Rust are some popular languages that do not require a garbage collection runtime. The former usually requires manual memory management (although C++ supports smart pointers), while the latter uses a borrow checker and lifetime analysis to ensure memory safety at compile time. Explicit memory management is usually faster and more suitable for embedded systems due to its determinism.

Functional vs Imperative

Functional programming is what many CS students learn in their first year, and it has been trending in recent years. Pure functional programming has no side effects, and everything is immutable. One benefit is that it makes logic clear and easier to debug, although it is not very beginner friendly. Languages of this type, such as Haskell and Racket, are commonly used in academia.

In contrast, imperative languages focus on flow control, introducing constructs like if-else clauses and loops. They are generally more readable for everyday use. To take advantage of functional programming, imperative languages often include partial support for it. This allows programmers to use functional thinking to write logic and easily assemble components to build maintainable and complex products.

Static Typing vs Dynamic Typing

Static typing means that a variable is bound to a specific data type at compile time and cannot be reassigned to a value of a different type. In contrast, dynamic typing allows variables to hold values of different types during runtime without explicit declarations. If we look deep enough into how computers work, we find that data types are an abstraction used by higher-level languages. At the hardware level, data is just a sequence of bits. The processor has no concept of types and only executes instructions. It’s the assembler and compiler that ensure data is interpreted correctly, e.g., IEEE 754. When we arrange data in contiguous memory and define how it should be interpreted, we create data structures. This close relationship between memory layout and type interpretation makes static typing natural in many compiled languages, where knowing the type is essential for generating correct machine code.

As interpreted languages became more popular, dynamic typing emerged as a more flexible alternative as the runtime can handle types dynamically. One major benefit of dynamic typing is its flexibility and speed during development, especially in scripting and prototyping. But for large projects, developers often reintroduce type systems, e.g., Python or TypeScript.

Strong Typing vs Weak Typing

The key difference between strong and weak typing is type coercion. Some languages, such as JavaScript and C/C++, automatically convert data types when performing operations. For example, in C, the expression 1 + '1' is valid because the character '1' is implicitly converted to its corresponding ASCII value. It's important not to confuse weak typing with operator overloading, as they are entirely different mechanisms. Weak typing can sometimes lead to unexpected behavior, so modern programmers tend to avoid relying on implicit conversions and instead write them out explicitly to improve code clarity and maintainability.

Technical Decisions

It is necessary to clarify what is Felys built for, so that decisions make sense.

What is Felys

Every modern programming language has its target. For example, Rust aims for memory safety without garbage collection, TypeScript adds typing to JavaScript, and Go targets simplicity and concurrency. It is not practical for a single person to develop and maintain an entire programming language, so at the very beginning, Felys does not aim at anything practical. Other than the reasons I stated in stories behind the screen, Felys is:

A DEMONSTRABLE PROJECT TO ANYONE REVIEWING MY RESUME

The keyword here is demonstrable which means that Felys should have a online playground where everyone can run and edit sample code. At the same time, I want Felys core package to be free from external dependencies to make it more challenging and also for gimmick purpose.

Design and Reasoning

Corresponding to last section, Felys is designed to be interpreted, garbage collected, imperative, static typing, and strong typing.

Interpreted and Garbage Collected

Interpreted languages typically offer greater control over resources, which allows me to make online playgrounds safer and more robust. Additionally, compilation would require Felys to either adapt to many different platforms or use LLVM as a backend. The first option is unstable and time-consuming, while LLVM represents an external dependency. Thus, interpretation and garbage collection are the only viable choices.

Imperative

Many programmers, myself included, are not accustomed to functional programming. It's very common for people to forget it after taking an introductory course in their first year of study. To ensure that users of the online playground can understand and modify the sample code, Felys should be imperative and use syntax that people are familiar with.

Static and Strong Typing

Enforcing static and strong typing will make programs less prone to silly mistakes and faster. The compiler can perform extensive checks to ensure type safety, allowing Felys to also support a powerful match statement. However, this feature is not yet supported, as it will be more natural to implement when building the custom virtual machine in the future.

Choice of the Language

Rust is the chosen language for its type system, but there are many other alternatives. For example, C/C++ are popular choices because they are fast and allow direct memory management, while Go is another option due to its modern design and ease of use. Other interpreted languages like Python and TypeScript are not considered, as they are overly abstracted from low-level operations.

Rust is a very interesting language. There are countless modern features that I enjoy, such as the match statement, the trait system, and its functional statements. All of these make code clean and allow it to accurately express the underlying logic. However, the lifetime system can sometimes make things too complicated, eventually compromising that cleanliness. In general, the advantages outweigh the disadvantages, so Felys will continue using Rust until a better alternative emerges.

Parser Generator

There are two ways to write a parser: a hand-written recursive descent parser or one generated by a parser generator such as ANTLR4 or YACC/Bison. The former is flexible but more prone to errors and harder to maintain, while the latter offers better maintainability but requires in-depth knowledge of parser generation. Felys originally used a hand-written parser but later migrated to a customized parser generator that also bootstraps itself. This section will discuss how to build a PEG parser generator along with a lexer generator.

Algorithm 3.36

Finite automata come in two types: deterministic (DFA) and non-deterministic (NFA). While DFAs are generally faster, NFAs allow more flexible transitions and is more extensive. However, our main parsing strategy is based on PEG, which supports infinite backtracking. Since the lexer only needs to recognize simple lexemes and efficiency is important, DFAs are preferred for lexical analysis. Algorithm 3.36 introduced by the Dragon Book can be used to construct DFAs directly.

Introduction to Language

This is a highly academic topic and too lengthy for this book, so I will briefly explain it in plain language. In short, a formal language is a set of strings over a given alphabet, and a regular expression is a syntactic tool used to describe certain types of formal languages, specifically regular languages. If s is a symbol from an alphabet, then L(s) = {s} is a language consisting of a single string. Let c1 and c2 be languages. The core operations used in regular expressions are:

  1. (c1)|(c2): denotes the union of c1 and c2, i.e., strings in either c1 or c2
  2. (c1)(c2): denotes concatenation, i.e., strings formed by c1 followed by c2
  3. (c)*: denotes the Kleene star, i.e., zero or more repetitions of strings from c
  4. (c): parentheses are used for grouping and are semantically neutral

Note: All these are still languages, so that induction works.

Preparator

Before generating them, we need to compute followpos using nullable, firstpos, and lastpos. To make this table look nicer, I will remove all pos postfixes. Note that parenthesis is not mentioned, because it only change precedence, so we can just recursively call into it.

Node: nnullable(n)first(n)last(n)
Position: ifalse{i}{i}
Union: c1|c2nullable(c1) || nullable(c2)first(c1) | first(c2)last(c2) | last(c1)
Concat: c1c2nullable(c1) && nullable(c2)first(c1) | first(c2) if nullable(c1) else first(c1)last(c2) | last(c1) if nullable(c2) else last(c2)
Kleene: c*truefirst(c)last(c)

Once we have these formulae, we can compute followpos, which can be represented as a graph G(V, E), where V is the set of positions, and E is the set of edges corresponding to followpos relationships. To compute this, we need to traverse the syntax tree. There are only two cases in which one position can be made to follow another:

  1. Concat(c1c2): all positions in first(c2) are in follow(i) for i in last(c1)
  2. Kleene(c*): all positions in first(c) are in follow(i) for i in last(c)

If your use case involves constructing very large transition table, it would be helpful to cache all the results of nullable, firstpos, and lastpos. For Felys, the transition table are usually small, so there is no need to bring extra overhead.

Construction

Here are the syntax tree nodes:

enum Language {
    Union(Box<Language>, Box<Language>),
    Concat(Box<Language>, Box<Language>),
    Kleene(Box<Language>),
    Nested(Box<Language>),
    Position(Position, usize),
}

enum Position {
    Set(Vec<(usize, usize)>),
    Pound,
}

This is pretty straight forward, but I do want clarify two designs here. The usize in Language::Position is its label or i mentioned in the table. Every position must have a unique i. Secondly, instead of a single character, we want to use Position::Set that represent all characters that acceptable for this position. This is a necessary modification to make this algorithm practical, because we can treat a range of characters as a whole. For instances (pseudocode only):

  • single character '0' is equivalent to [('0', '0')]
  • inclusive set [0-9] is equivalent to [('0', '9')]
  • exclusive set [^0-9] is equivalent to [(MIN, '/'), (':', MAX)]

Core Algorithm

Before building everything, we need to append a special pound symbol at the end of the language to identify the terminal state. It's recommended to do this at the syntax tree level rather than directly appending it to the regular expression.

Once the syntax tree is built and the pound is appended, we can compute the followpos graph, and also collect all the position indices.

Then we can use apply the following algorithm (the original pseudocode):

initialize Dstates to contain only the unmarked state firstpos(n0),
    where n0 is the root of syntax tree T for (r)#;
while ( there is an unmarked state S in Dstates ) {
    mark S;
    for ( each input symbol a ) {
        let U be the union of followpos(p) for all p
            in S that correspond to a;
        if ( U is not in Dstates )
            add U as an unmarked state to Dstates;
        Dtran[S, a] = U;
    }
}

The accepting states are those that contain the pound symbol.

Improve the Algorithm

However, when performing each input symbol a, there are millions of Unicode characters, so it's not practical to iterate through all of them. The previously mentioned modification, Position::Set, is designed to address this issue. Nevertheless, it introduces another problem: to compute U, we cannot simply use a set to store the data, because the ranges may overlap. Therefore, additional splitting and merging are required to produce the minimal number of atomic ranges. There are three steps involved: compute boundaries, create atomic ranges, and select valid ranges.

Here's an implementation in Rust:

let mut saturated = false;
let mut boundaries = Vec::with_capacity(symbols.len() * 2);
for &(start, end) in &symbols {
    if end == usize::MAX {
        saturated = true;
    }
    boundaries.push(start);
    boundaries.push(end.saturating_add(1));
}
boundaries.sort_unstable();
boundaries.dedup();

let mut atomic = boundaries
    .windows(2)
    .map(|x| (x[0], x[1].saturating_sub(1)))
    .collect::<Vec<_>>();
if saturated {
    atomic.last_mut().unwrap().1 = usize::MAX;
}

let ranges = atomic
    .into_iter()
    .filter(|x| {
        symbols
            .iter()
            .any(|&(start, end)| start <= x.0 && x.1 <= end)
    })
    .collect::<Vec<_>>();

Then we can safely iterate through ranges instead of all unicode characters.

Representation

Unlike standard transition table represented in matrix, our version takes range and transfer into another state. There are many ways to do it, but for Rust we can use a match expression with multiple (start..=end) => state ended with a _ => break inside a loop. Once the loop breaks, we can check the acceptance of the state.

Next Step

Finite automata is a well-studied topic in computer science, and I have only covered the tip of the iceberg. If you're interested, please refer to the Dragon Book. It contains an entire section on lexical analysis techniques, presented in highly academic languages.

Grammar File

Felys grammar file uses customized syntax to generate the parser and lexer. However, due to limitations of the parser generator, the non-terminals ident and eof are still written by hand. Their implementations can be found in the source code.

{
    use crate::ast::*;
    #[allow(unused)]
    use std::rc::Rc;
}

peg grammar -> { Grammar }:
    / stmts=(stmt=stmt $)* #eof { Grammar(stmts) }
    ;

@(memo)
peg pat -> { Pat }:
    / UNDERSCORE { Pat::Any }
    / ident=ident { Pat::Ident(ident) }
    / lit=lit { Pat::Lit(lit) }
    / '(' first=pat ',' second=pat more=(',' pat=pat)* ','? ')' {
        Pat::Tuple(BufVec::new([first, second], more))
    }
    ;

peg stmt -> { Stmt }:
    / expr=expr ';' { Stmt::Semi(expr) }
    / expr=expr { Stmt::Expr(expr) }
    / ';' { Stmt::Empty }
    ;

peg block -> { Block }:
    / '{' stmts=stmt* #'}' { Block(stmts) }
    ;

@(memo)
peg expr -> { Expr }:
    / assignment=assignment
    / disjunction=disjunction
    / block=block { Expr::Block(block) }
    / BREAK expr=expr? { Expr::Break(expr.map(Rc::new)) }
    / CONTINUE { Expr::Continue }
    / FOR pat=#pat #IN expr=#expr block=#block { Expr::For(pat, expr.into(), block) }
    / IF expr=#expr block=#block otherwise=(ELSE expr=#expr)? {
        Expr::If(expr.into(), block, otherwise.map(Rc::new))
    }
    / LOOP block=#block { Expr::Loop(block) }
    / RETURN expr=expr? { Expr::Return(expr.map(Rc::new)) }
    / WHILE expr=#expr block=#block { Expr::While(expr.into(), block) }
    / STEP loss=#expr #BY lr=#expr { Expr::Step(loss.into(), lr.into()) }
    / PRINT expr=#expr { Expr::Print(expr.into()) }
    ;

peg assignment -> { Expr }:
    / pat=pat '=' !'=' expr=#expr { Expr::Assign(pat, AssOp::Eq, expr.into()) }
    / pat=pat '+=' expr=#expr { Expr::Assign(pat, AssOp::AddEq, expr.into()) }
    / pat=pat '-=' expr=#expr { Expr::Assign(pat, AssOp::SubEq, expr.into()) }
    / pat=pat '*=' expr=#expr { Expr::Assign(pat, AssOp::MulEq, expr.into()) }
    / pat=pat '/=' expr=#expr { Expr::Assign(pat, AssOp::DivEq, expr.into()) }
    / pat=pat '%=' expr=#expr { Expr::Assign(pat, AssOp::ModEq, expr.into()) }
    ;

peg disjunction -> { Expr }:
    / lhs=disjunction OR rhs=#conjunction {
        Expr::Binary(lhs.into(), BinOp::Or, rhs.into())
    }
    / conjunction=conjunction
    ;

peg conjunction -> { Expr }:
    / lhs=conjunction AND rhs=#inversion {
        Expr::Binary(lhs.into(), BinOp::And, rhs.into())
    }
    / inversion=inversion
    ;

peg inversion -> { Expr }:
    / NOT inversion=#inversion { Expr::Unary(UnaOp::Not, inversion.into()) }
    / equality=equality
    ;

peg equality -> { Expr }:
    / lhs=equality '==' rhs=#comparison { Expr::Binary(lhs.into(), BinOp::Eq, rhs.into()) }
    / lhs=equality '!=' rhs=#comparison { Expr::Binary(lhs.into(), BinOp::Ne, rhs.into()) }
    / comparison=comparison
    ;

peg comparison -> { Expr }:
    / lhs=comparison '>=' rhs=#term { Expr::Binary(lhs.into(), BinOp::Ge, rhs.into()) }
    / lhs=comparison '<=' rhs=#term { Expr::Binary(lhs.into(), BinOp::Le, rhs.into()) }
    / lhs=comparison '>' rhs=#term { Expr::Binary(lhs.into(), BinOp::Gt, rhs.into()) }
    / lhs=comparison '<' rhs=#term { Expr::Binary(lhs.into(), BinOp::Lt, rhs.into()) }
    / term=term
    ;

peg term -> { Expr }:
    / lhs=term '+' rhs=#factor { Expr::Binary(lhs.into(), BinOp::Add, rhs.into()) }
    / lhs=term '-' rhs=#factor { Expr::Binary(lhs.into(), BinOp::Sub, rhs.into()) }
    / factor=factor
    ;

peg factor -> { Expr }:
    / lhs=factor '*' rhs=#dot { Expr::Binary(lhs.into(), BinOp::Mul, rhs.into()) }
    / lhs=factor '/' rhs=#dot { Expr::Binary(lhs.into(), BinOp::Div, rhs.into()) }
    / lhs=factor '%' rhs=#dot { Expr::Binary(lhs.into(), BinOp::Mod, rhs.into()) }
    / dot=dot
    ;

peg dot -> { Expr }:
    / lhs=dot '@' rhs=#unary { Expr::Binary(lhs.into(), BinOp::Dot, rhs.into()) }
    / unary=unary
    ;

peg unary -> { Expr }:
    / '+' unary=#unary { Expr::Unary(UnaOp::Pos, unary.into()) }
    / '-' unary=#unary { Expr::Unary(UnaOp::Neg, unary.into()) }
    / call=call
    ;

peg call -> { Expr }:
    / call=call '(' args=args? #')' {
        Expr::Call(call.into(), args)
    }
    / primary=primary
    ;

peg args -> { BufVec<Expr, 1> }:
    / first=expr more=(',' expr=expr)* ','? { BufVec::new([first], more) }
    ;

peg primary -> { Expr }:
    / lit=lit { Expr::Lit(lit) }
    / ident=ident { Expr::Ident(ident) }
    / RUST ident=#ident { Expr::Rust(ident) }
    / '(' expr=#expr ')' { Expr::Paren(expr.into()) }
    / '(' first=#expr #',' second=#expr more=(',' expr=expr)* ','? #')' {
        Expr::Tuple(BufVec::new([first, second], more))
    }
    / '[' first=row more=row* #']' { Expr::Matrix(BufVec::new([first], more)) }
    / '[' args=args? #']' { Expr::List(args) }
    / '|' params=params? #'|' expr=#expr { Expr::Closure(params, expr.into()) }
    / '<' rows=#INT #',' cols=#INT #'>' { Expr::Param(rows, cols, x.stream.cursor) }
    ;

peg row -> { BufVec<Float, 1> }:
    / first=FLOAT more=(',' f=FLOAT)* ';' { BufVec::new([first], more) }
    ;

peg params -> { BufVec<Ident, 1> }:
    / first=ident more=(',' ident=ident)* ','? { BufVec::new([first], more) }
    ;

peg lit -> { Lit }:
    / float=FLOAT { Lit::Float(float) }
    / int=INT { Lit::Int(int) }
    / str=STR { Lit::Str(str) }
    / bool=BOOL { Lit::Bool(bool) }
    ;

lex FLOAT -> { Float }:
    / float=UFX { Float::Positive(float) }
    / '-' float=UFX { Float::Negative(float) }
    ;

lex INT -> { Int }: int=USIZE { Int(int) } ;

lex STR -> { Vec<Chunk> }: '"' chunks=CHUNK* #'"' ;

lex CHUNK -> { Chunk }:
    / slice=SLICE { Chunk::Slice(slice) }
    / '\\u{' hex=#HEX #'}' { Chunk::Unicode(hex) }
    / '\\' escape=#ESCAPE { Chunk::Escape(escape) }
    ;

lex BOOL -> { Bool }:
    / "true" !TAIL { Bool::True }
    / "false" !TAIL { Bool::False }
    ;

lex UNDERSCORE -> { () }: "_" !TAIL ;
lex BREAK -> { () }: "break" !TAIL ;
lex CONTINUE -> { () }: "continue" !TAIL ;
lex FOR -> { () }: "for" !TAIL ;
lex IN -> { () }: "in" !TAIL ;
lex IF -> { () }: "if" !TAIL ;
lex LOOP -> { () }: "loop" !TAIL ;
lex RETURN -> { () }: "return" !TAIL ;
lex WHILE -> { () }: "while" !TAIL ;
lex ELSE -> { () }: "else" !TAIL ;
lex OR -> { () }: "or" !TAIL ;
lex AND -> { () }: "and" !TAIL ;
lex NOT -> { () }: "not" !TAIL ;
lex STEP -> { () }: "step" !TAIL ;
lex BY -> { () }: "by" !TAIL ;
lex PRINT -> { () }: "print" !TAIL ;
lex RUST -> { () }: "rust" !TAIL ;

@(intern, memo)
IDENT: [a-zA-Z_] TAIL* ;
TAIL: [a-zA-Z0-9_] ;

@(intern) {
    USIZE: '0' | [1-9][0-9]* ;
    UFX: [1-9][0-9]* '.' [0-9]+ | '0.' [0-9]+;
    SLICE: [^"\\]+ ;
    HEX: [0-9a-f]* ;
    ESCAPE: ['ntr\\] ;
}

@(ws) {
    WS: [\u{20}\n\t\r]+ ;
    COMMENT: '//' [^\n]* ;
}

Meta Syntax

This is a fully functional meta-syntax that can bootstrap the parser generator. The non-terminals action and eof are handwritten. The former requires embedding a foreign language, so it needs to maintain a brace counter, which is non-standard.

{
    use crate::ast::*;
    #[allow(unused_imports)]
    use crate::builder::common::s2c;
}

peg grammar -> { Grammar }:
    / import=action? $ callables=(callable=callable $)* #eof {
        Grammar { import, callables }
    }
    ;

peg callable -> { Callable }:
    / decorator=decorator? prefix=PREFIX name=#NAME #'->' ty=#action #':' rule=#rule #';' {
        Callable::Rule(decorator, prefix, name, ty, rule)
    }
    / decorator=decorator? name=NAME #':' regex=#union #';' {
        Callable::Regex(decorator, name, regex)
    }
    / decorator=decorator #'{' shared=#(name=NAME #':' regex=#union #';')+ #'}' {
        Callable::Shared(decorator, shared)
    }
    ;

peg decorator -> { Decorator }:
    / '@' #'(' first=#tag more=(',' tag=#tag)* #')' { Decorator { first, more } }
    ;

peg tag -> { Tag }:
    / 'memo' { Tag::Memo }
    / 'left' { Tag::Left }
    / 'intern' { Tag::Intern }
    / 'ws' { Tag::Whitespace }
    ;

peg rule -> { Rule }:
    / '/'? first=#alter more=('/' alter=#alter)* {
        Rule { first, more }
    }
    ;

peg alter -> { Alter }:
    / assignments=#assignment+ action=action? { Alter { assignments, action } }
    ;

peg assignment -> { Assignment }:
    / name=NAME '=' item=#item { Assignment::Named(name, item) }
    / lookahead=lookahead { Assignment::Lookahead(lookahead) }
    / item=item { Assignment::Anonymous(item) }
    / '$' { Assignment::Clean }
    ;

peg lookahead -> { Lookahead }:
    / '&' atom=#atom { Lookahead::Positive(atom) }
    / '!' atom=#atom { Lookahead::Negative(atom) }
    ;

peg item -> { Item }:
    / atom=atom '?' { Item::Optional(atom) }
    / atom=atom '*' { Item::ZeroOrMore(atom) }
    / eager='#'? atom=atom '+' { Item::OnceOrMore(eager.is_some(), atom) }
    / eager='#'? atom=atom { Item::Name(eager.is_some(), atom) }
    ;

peg atom -> { Atom }:
    / '(' rule=#rule #')' { Atom::Nested(rule) }
    / expect=EXPECT { Atom::Expect(expect) }
    / name=NAME { Atom::Name(name) }
    ;

peg union -> { Regex }:
    / lhs=union '|' rhs=#concat { Regex::Union(lhs.into(), rhs.into()) }
    / concat=concat { concat }
    ;

peg concat -> { Regex }:
    / lhs=concat rhs=repeat { Regex::Concat(lhs.into(), rhs.into()) }
    / repeat=repeat { repeat }
    ;

peg repeat -> { Regex }:
    / inner=repeat '+' { Regex::OnceOrMore(inner.into()) }
    / inner=repeat '*' { Regex::ZeroOrMore(inner.into()) }
    / primary=primary { Regex::Primary(primary) }
    ;

peg primary -> { Primary }:
    / '(' regex=#union #')' { Primary::Parentheses(regex.into()) }
    / string=STRING { Primary::Literal(string) }
    / name=NAME { Primary::Name(name) }
    / set=SET { set }
    ;

lex PREFIX -> { Prefix }:
    / 'peg' !TAIL { Prefix::Peg }
    / 'lex' !TAIL { Prefix::Lex }
    ;

lex SET -> { Primary }:
    / '[' '^' set=#RANGE+ #']' { Primary::Exclude(set) }
    / '[' set=#RANGE+ #']' { Primary::Include(set) }
    ;

lex EXPECT -> { Expect }:
    / '\'' expect=#SQC+ #'\'' { {
        let expect = expect
            .iter()
            .map(|c| s2c(x.intern.get(c).unwrap()))
            .collect::<String>();
        let id = x.intern.id(expect.as_str());
        Expect::Once(id)
    } }
    / '"' expect=#DQC+ #'"' { {
        let expect = expect
            .iter()
            .map(|c| s2c(x.intern.get(c).unwrap()))
            .collect::<String>();
        let id = x.intern.id(expect.as_str());
        Expect::Keyword(id)
    } }
    ;

lex STRING -> { Vec<usize> }:
    / '\'' string=#SQC+ #'\'' { string }
    ;

lex RANGE -> { (usize, usize) }:
    / start=BC '-' end=#BC { (start, end) }
    / ch=BC { (ch, ch) }
    ;

@(intern, memo)
NAME: [a-zA-Z_] TAIL* ;
TAIL: [a-zA-Z0-9_] ;

@(intern) {
    DQC: UNICODE | ESCAPE | [^\"\\] ;
    SQC: UNICODE | ESCAPE | [^\'\\] ;
    BC: UNICODE | ESCAPE | [^\]\\] ;
}

@(ws) {
    WS: [\u{20}\n\t\r]+ ;
    COMMENT: '//' [^\n]* ;
}

UNICODE: '\\' 'u' '{' [0-9a-f]* '}' ;
ESCAPE: '\\' [\'\"\[\]ntr\\] ;

Neural Network

The embedded neural network training functionality consists of two core components: automatic differentiation and dynamic computation graph construction. The former is essentially an application of the multivariate chain rule. The latter is more interesting, as it can consistently construct a connected graph even when control flows make the computation non-functional. This section will cover both components, supported by relevant calculus.

一些岁月史书

整个项目至今已经经过了两年的迭代,而这一切的一切源于两件事。我在十年级时在是接触的 Python,当时只是跟着网上的课程学习,但课程的作者特别喜欢讲解相对底层的概念,导致我对此过于感兴趣,甚至在很长一段时间里忽略了编程语言本身的意义,种子在那时便已经埋下。高中后两年的生活相对独来独往,大部分感性生活是依靠原神和崩坏三这两个游戏支持,而后者塑造的爱莉希雅则是最浓墨重彩的一笔,这便是第二颗种子。

回想高中毕业的那个暑假里,靠着对 Python 的理解再配合上后缀式作为核心,用 C 搓出了第一版 Felys,因为完全没有听说过 AST 这个东西,为了达成控制流只能强行解析控制语句的跳行关系,然后由运行时执行跳行,现在来看简直就是在模拟汇编。后来才学习到了如何手写一个解析器,并且改用 Rust 重写了整个项目,轻而易举的实现了以前不敢想的功能,甚至当时还支持了纯中文的编程,不过维护太麻烦砍掉了。

事实上在那个时间点,这个项目已经摸到了非学术水准的极限了,再想进一步必然需要引入更专业的知识,所以我从最了解的编程语言 Python 下手,吃透了 PEG 解析器以及代码生成的概念,并且自己实现了一版解析器生成器,又因为 PEG 不适合处理分词的任务,我额外阅读并且实现了龙书里关于正则机相关的内容。至此语言的前端已经让我满足,但是后端目前只能依赖递归运行,运行速度慢还占内存,这也是这个项目未来的迭代方向。除此之外,由于实习的原因,接触到了很多机器学习的知识,并且发现自动求导和动态计算图构建这两个功能非常适合嵌入进编程语言中,因此 Felys 现在便内置了神经网络的支持。

最后一言蔽之:为了一碟醋,包了盘饺子。