一直想去完成pingcap的tanlent Plan里面的tinySql路径,但一直没有开始。最近论坛上发现了pingcap创始人的tanlent plan入坑贴,下面的学长提示了这个从零实现数据库博客入门比较合适,于是写下这个博客。

前端

与编译器类似,一个数据库软件也分为前端和后端。前端部分主要是SQL语句的解析以及一个REPL。

获取tokens

1
2
3
4
5
type token struct {
	value string
	kind  tokenKind
	loc   location
}

模拟cursor扫描语句,将sql语句解析为一个一个token,token中包括了单词的种类,位置,值

获取单个token时,每次扫描结束,如果成功,更新光标,并返回一个新token地址以及更新后的光标位置,ok=true。

如果失败,则返回nil,传入的初始化的光标位置不变,ok=false。

最终处理所有语句,得到token地址数组

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// 主要流程
func lex(source string) ([]*token, error) {
	tokens := []*token{}
	cur := cursor{}

	for cur.pointer < uint(len(source)) {
		lexers := []lexer{lexKeyword, lexSymbol, lexString, lexNumeric, lexIdentifier}
		isMatched := false
		for _, l := range lexers {
			if token, newCursor, ok := l(source, cur); ok {
				cur = newCursor

				// Omit nil tokens for valid, but empty syntax like newlines
				if token != nil {
					tokens = append(tokens, token)
				}
				isMatched = true
				break
			}
		}

		if isMatched {
			// match the next one
			continue
		}

		hint := ""
		if len(tokens) > 0 {
			hint = " after " + tokens[len(tokens)-1].value
		}
		return nil, fmt.Errorf("Unable to lex token%s, at %d:%d", hint, cur.loc.row, cur.loc.col)
	}
	return tokens, nil
}

获得Ast

第二步,从tokens中,解析出一个抽象语法树。

Ast有statement[]组成,目前的解析器只解析select,insert,create三种语句。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type Ast struct {
	Statement []*Statement
}

type Statement struct {
	SelectStatement *SelectStatement
	CreateStatement *CreateStatement
	InsertStatement *InsertStatement
	Kind            AstKind
}

解析的思路是按照不同的sql语句,分析出相应的语法成分,最后形成最后的语法树

SELECT语句

语法成分:

SELECT ${ expression[] }$ FROM $ tableName $

example:select id,age,name from students

1.解析出keyword SELECT

2.解析expression[]

3.解析出keyword FROM

4.解析identifier $tableName$

其中重点在于对expressions的理解与解析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// parse the token at cursor and return an expression containing literal val and its kind
func parseExpression(tokens []*token, initialCursor uint, _ token) (*expression, uint, bool) {
	cursor := initialCursor

	kinds := []tokenKind{identifierKind, numericKind, stringKind}
    
	for _, kind := range kinds {
		t, newCursor, ok := parseToken(tokens, cursor, kind)
		if ok {
			return &expression{
				literal: t,
				kind:    literalKind,
			}, newCursor, true
		}
	}

	return nil, initialCursor, false
}

expression 只包括了stringKind,numericKind,identifierKind的字符集合。parseexpression读入一个token,如果cursor超出了tokens的长度或者Kind不是这三种,则返回错误的结果,遇到分隔符delimiter则正常返回。

CREATE语句

语法成分

CREATE TABLE $tableName$ ( $columnDef$ )

example:CREATE TABLE students(id string, age int, name string)

1.解析出keyword CREATE

2.解析出keyword TABLE

3.解析identifier $tableName$

4.解析 symbol (

5.解析$columndef$

6.解析symbol )

解析CREATE的语句关键是对$columnDef$的解析

1
2
3
4
5
//column
type columnDefinition struct {
	name     token
	datatype token
}

解析columnDefinitions

1.循环解析token,直到cur超出tokens范围或者解析出现错误

  • 首先解析identifier $name$,keyword $datatype$
  • 解析分隔符 symbol ,
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// parse col def
//
func parseColumnDefinitions(tokens []*token, initialCursor uint, delimiter token) (*[]*columnDefinition, uint, bool) {
	cursor := initialCursor

	cds := []*columnDefinition{}
	for {
		if cursor >= uint(len(tokens)) {
			return nil, initialCursor, false
		}

		// Look for a delimiter
		current := tokens[cursor]
		if delimiter.equals(current) {
			break
		}

		// Look for a comma
		// col devided by comma
		if len(cds) > 0 {
			if !expectToken(tokens, cursor, tokenFromSymbol(commaSymbol)) {
				helpMessage(tokens, cursor, "Expected comma")
				return nil, initialCursor, false
			}

			cursor++
		}

		// Look for a column name
		id, newCursor, ok := parseToken(tokens, cursor, identifierKind)
		if !ok {
			helpMessage(tokens, cursor, "Expected column name")
			return nil, initialCursor, false
		}
		cursor = newCursor

		// Look for a column type
		//like col_name type
		ty, newCursor, ok := parseToken(tokens, cursor, keywordKind)
		if !ok {
			helpMessage(tokens, cursor, "Expected column type")
			return nil, initialCursor, false
		}
		cursor = newCursor

		cds = append(cds, &columnDefinition{
			name:     *id,
			datatype: *ty,
		})
	}

	return &cds, cursor, true
}

INSERT语句

语法成分

INSERT INTO VALUES ( $expression[]$ )

example: INSERT INTO VALUES(1,20,Kate)

1.解析keyword INSERT

2.解析keyword INTO

3.解析keyword VALUES

4.解析symbol (

5.解析 $expression[]$

6.解析symbol )

主函数parse

主函数分号作为分隔符解析三种sql语句

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
func Parse(source string) (*Ast, error) {
	tokens, err := lex(source)
	if err != nil {
		return nil, err
	}

	ast := Ast{}
	cursor := uint(0)
	for cursor < uint(len(source)) {
		//parse each statement divide by ';' and add to ast
		stmt, newCursor, ok := parseStatement(tokens, cursor, tokenFromSymbol(semicolonSymbol))
		if !ok {
			helpMessage(tokens, cursor, "Expected statement")
			return nil, errors.New("Failed to parse,exit")
		}
		cursor = newCursor
		ast.Statement = append(ast.Statement, stmt)

		atLeastOneSemicolon := false
		for expectToken(tokens, cursor, tokenFromSymbol(semicolonSymbol)) {
			cursor++
			atLeastOneSemicolon = true
		}

		if !atLeastOneSemicolon {
			helpMessage(tokens, cursor, "Expected semi-colon delimiter between statements")
			return nil, errors.New("Missing semi-colon between statements")
		}
	}
	return &ast, nil
}

简易后端

本文重点在于解析sql语句以及前端的编写思路,这次仅仅使用一个map in memory 作为后端。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// backend in memory
type MemoryBackend struct {
	tables map[string]*table
}

// table
type table struct {
	columns     []string
	columnTypes []ColumnType
	rows        [][]MemoryCell
}

//Result
type Results struct {
	Columns []struct {
		Type ColumnType
		Name string
	}
	Rows [][]Cell
}

REPL

下面是一个简单的REPL设计思路

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func main() {
	mb := NewMemoryBackend()

	reader := bufio.NewReader(os.Stdin)
	fmt.Println("Welcome to ")
	for {
		fmt.Print("# ")
		text, err := reader.ReadString('\n')
		text = strings.Replace(text, "\n", "", -1)

		ast, err := Parse(text)
		if err != nil {
			panic(err)
		}

		for _, stmt := range ast.Statements {
			switch stmt.Kind {
			case CreateKind:
				
			case InsertKind:
				
			case SelectKind:
						
			}
	}
}

其他剩余代码见github

参考

Writing a SQL database from scratch in Go

PostgreSQL documentation

gosql