一直想去完成pingcap的tanlent Plan里面的tinySql路径,但一直没有开始。最近论坛上发现了pingcap创始人的tanlent plan入坑贴,下面的学长提示了这个从零实现数据库博客入门比较合适,于是写下这个博客。
前端
与编译器类似,一个数据库软件也分为前端和后端。前端部分主要是SQL语句的解析以及一个REPL。
获取tokens
1
2
3
4
5
|
type token struct {
value string
kind tokenKind
loc location
}
|
模拟cursor扫描语句,将sql语句解析为一个一个token,token中包括了单词的种类,位置,值。
获取单个token时,每次扫描结束,如果成功,更新光标,并返回一个新token地址以及更新后的光标位置,ok=true。
如果失败,则返回nil,传入的初始化的光标位置不变,ok=false。
最终处理所有语句,得到token地址数组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
// 主要流程
func lex(source string) ([]*token, error) {
tokens := []*token{}
cur := cursor{}
for cur.pointer < uint(len(source)) {
lexers := []lexer{lexKeyword, lexSymbol, lexString, lexNumeric, lexIdentifier}
isMatched := false
for _, l := range lexers {
if token, newCursor, ok := l(source, cur); ok {
cur = newCursor
// Omit nil tokens for valid, but empty syntax like newlines
if token != nil {
tokens = append(tokens, token)
}
isMatched = true
break
}
}
if isMatched {
// match the next one
continue
}
hint := ""
if len(tokens) > 0 {
hint = " after " + tokens[len(tokens)-1].value
}
return nil, fmt.Errorf("Unable to lex token%s, at %d:%d", hint, cur.loc.row, cur.loc.col)
}
return tokens, nil
}
|
获得Ast
第二步,从tokens中,解析出一个抽象语法树。
Ast有statement[]组成,目前的解析器只解析select,insert,create三种语句。
1
2
3
4
5
6
7
8
9
10
|
type Ast struct {
Statement []*Statement
}
type Statement struct {
SelectStatement *SelectStatement
CreateStatement *CreateStatement
InsertStatement *InsertStatement
Kind AstKind
}
|
解析的思路是按照不同的sql语句,分析出相应的语法成分,最后形成最后的语法树
SELECT语句
语法成分:
SELECT ${ expression[] }$ FROM $ tableName $
example:select id,age,name from students
1.解析出keyword SELECT
2.解析expression[]
3.解析出keyword FROM
4.解析identifier $tableName$
其中重点在于对expressions的理解与解析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
// parse the token at cursor and return an expression containing literal val and its kind
func parseExpression(tokens []*token, initialCursor uint, _ token) (*expression, uint, bool) {
cursor := initialCursor
kinds := []tokenKind{identifierKind, numericKind, stringKind}
for _, kind := range kinds {
t, newCursor, ok := parseToken(tokens, cursor, kind)
if ok {
return &expression{
literal: t,
kind: literalKind,
}, newCursor, true
}
}
return nil, initialCursor, false
}
|
expression 只包括了stringKind,numericKind,identifierKind的字符集合。parseexpression读入一个token,如果cursor超出了tokens的长度或者Kind不是这三种,则返回错误的结果,遇到分隔符delimiter则正常返回。
CREATE语句
语法成分
CREATE TABLE $tableName$ ( $columnDef$ )
example:CREATE TABLE students(id string, age int, name string)
1.解析出keyword CREATE
2.解析出keyword TABLE
3.解析identifier $tableName$
4.解析 symbol (
5.解析$columndef$
6.解析symbol )
解析CREATE的语句关键是对$columnDef$的解析
1
2
3
4
5
|
//column
type columnDefinition struct {
name token
datatype token
}
|
解析columnDefinitions
1.循环解析token,直到cur超出tokens范围或者解析出现错误
- 首先解析identifier $name$,keyword $datatype$
- 解析分隔符 symbol ,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
// parse col def
//
func parseColumnDefinitions(tokens []*token, initialCursor uint, delimiter token) (*[]*columnDefinition, uint, bool) {
cursor := initialCursor
cds := []*columnDefinition{}
for {
if cursor >= uint(len(tokens)) {
return nil, initialCursor, false
}
// Look for a delimiter
current := tokens[cursor]
if delimiter.equals(current) {
break
}
// Look for a comma
// col devided by comma
if len(cds) > 0 {
if !expectToken(tokens, cursor, tokenFromSymbol(commaSymbol)) {
helpMessage(tokens, cursor, "Expected comma")
return nil, initialCursor, false
}
cursor++
}
// Look for a column name
id, newCursor, ok := parseToken(tokens, cursor, identifierKind)
if !ok {
helpMessage(tokens, cursor, "Expected column name")
return nil, initialCursor, false
}
cursor = newCursor
// Look for a column type
//like col_name type
ty, newCursor, ok := parseToken(tokens, cursor, keywordKind)
if !ok {
helpMessage(tokens, cursor, "Expected column type")
return nil, initialCursor, false
}
cursor = newCursor
cds = append(cds, &columnDefinition{
name: *id,
datatype: *ty,
})
}
return &cds, cursor, true
}
|
INSERT语句
语法成分
INSERT INTO VALUES ( $expression[]$ )
example: INSERT INTO VALUES(1,20,Kate)
1.解析keyword INSERT
2.解析keyword INTO
3.解析keyword VALUES
4.解析symbol (
5.解析 $expression[]$
6.解析symbol )
主函数parse
主函数分号作为分隔符解析三种sql语句
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
func Parse(source string) (*Ast, error) {
tokens, err := lex(source)
if err != nil {
return nil, err
}
ast := Ast{}
cursor := uint(0)
for cursor < uint(len(source)) {
//parse each statement divide by ';' and add to ast
stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
if !ok {
helpMessage(tokens, cursor, "Expected statement")
return nil, errors.New("Failed to parse,exit")
}
cursor = newCursor
ast.Statement = append(ast.Statement, stmt)
atLeastOneSemicolon := false
for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
cursor++
atLeastOneSemicolon = true
}
if !atLeastOneSemicolon {
helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
return nil, errors.New("Missing semi-colon between statements")
}
}
return &ast, nil
}
|
简易后端
本文重点在于解析sql语句以及前端的编写思路,这次仅仅使用一个map in memory 作为后端。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
// backend in memory
type MemoryBackend struct {
tables map[string]*table
}
// table
type table struct {
columns []string
columnTypes []ColumnType
rows [][]MemoryCell
}
//Result
type Results struct {
Columns []struct {
Type ColumnType
Name string
}
Rows [][]Cell
}
|
REPL
下面是一个简单的REPL设计思路
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
func main() {
mb := NewMemoryBackend()
reader := bufio.NewReader(os.Stdin)
fmt.Println("Welcome to ")
for {
fmt.Print("# ")
text, err := reader.ReadString('\n')
text = strings.Replace(text, "\n", "", -1)
ast, err := Parse(text)
if err != nil {
panic(err)
}
for _, stmt := range ast.Statements {
switch stmt.Kind {
case CreateKind:
case InsertKind:
case SelectKind:
}
}
}
|
其他剩余代码见github
参考
Writing a SQL database from scratch in Go
PostgreSQL documentation
gosql