Complexities of parsing SQL
SQL (structured query language) is a language databases use to do everything - control server behavior, security, authentication, schema, programming, data manipulation, data querying - and more.
Writing a parser for that has been... challenging. In my inexperience, I set out to write parsers for 4 SQL dialects at the same time (MySQL, SQL Server, Oracle, Postgres), as well as a translation engine for transpilation between those dialects.
I managed to pull it off, but it wasn't pretty, and it took a long time to get right.
Problem 1: SQL is huge
Let's take a quick look at ANTLR grammars for some programming languages:
- Ruby grammar is 414 lines
- C grammar is 896 lines
- C++ grammar is 822 lines
Now let's take a quick look at some SQL dialect grammars:
- SQL Server grammar is 4655 lines
- MySQL grammar is 5021 lines (its original server YACC grammar is 17863 lines)
- Oracle PL/SQL grammar is 6768 lines
Note that above SQL grammars are for the same language and we are just talking about deviations from and additions to the standard. Which leads us to...
Problem 2: each database uses own dialect of SQL
Database vendors often innovated faster than the standard. Different vendors would use different syntax to achieve roughly the same thing, so we need a different grammar for each database.
Problem 3: SQL is complex
Figuring out where statements begin and where do they end is far from trivial as statements can be nested. One can use common table expressions as a prefix for other statements so we can't begin and stop parsing on a set of specific words.
Also, there are so many concepts! We have servers, databases, schemas, tables, columns, data types, default expressions, column constraints, table constraints, indexes, foreign keys, views, functions, procedures, triggers, roles, permissions, variables, cursors...