Ways to create Spark grammar from SAS job using ANTLR - sas

I am trying to use ANTLR for parsing SAS jobs and creating parser based on that.
I am using SAS grammar from https://github.com/xueqilsj/sas-grammar and using ANTLR for parsing and lexical analyzer. I am also using the link provided here: https://shijinglu.wordpress.com/2015/01/22/write-a-primitive-sas-grammar-in-antlr4/ .
Just to clarify, every SAS gramamr is of the format:
grammar AbortStmt;
import CommonLexerRules;
abort_main
: (abort_stmt)* EOF
;
abort_stmt
: ABORT (ABEND | CANCEL (file_spec)? | RETURN )? INT? NOLIST? ';'
;
file_spec
: STRINGLITERAL
;
and I am having problem with the import statement. After I have created the autogenerated class using ANTLR , I am getting the following error:
Can't load AbortStmtBaseListener.class as lexer or parser.
Can’t import the Rules.
I am limited by the SAS imports that are defined in the grammar file(as every grammar file has a defined import). Any other ways to parse the grammar files and create decision trees?

Once you run ANTLR on the grammar, the result is a group of .java files. Did you then run javac to compile them into .class files? That might account for your problem. I've forgotten to do that before. After that you can run TestRig to see your tokens or the graphical representation of your parse.

Related

How are Path Expressions used in BigQuery

BigQuery documentation describes path expressions, which look like this:
foo.bar
foo.bar/25
foo/bar:25
foo/bar/25-31
/foo/bar
/25/foo/bar
But it doesn't say a lot about how and where these path expressions are used. It only briefly mentions:
A path expression describes how to navigate to an object in a graph of objects.
But what is this graph of objects?
How would you use this syntax with a graph of objects?
What's the meaning of a path expression like foo/bar/25-31?
My question is: what are these Path Expressions the official documentation describes?
I've searched through BigQuery docs but haven't managed to find any other mention of these path expressions. Is this syntax actually part of BigQuery SQL at all?
What I've found out so far
There is an existing question, which asks roughly the same thing, but for some reason it's downvoted and none of the answers are correct. Though the question it asks is more about a specific detail of the path expression syntax.
Anyway, the answers there propose a few hypotheses as to what path expressions are:
It's not a syntax for referencing tables
The BigQuery Legacy SQL uses syntax that's similar to path expressions for referencing tables:
SELECT state, year FROM [bigquery-public-data:samples.natality]
But that syntax is only valid in BigQuery Legacy SQL. In the new Google Standard SQL it produces a syntax error. There's a separate documentation for table path syntax, which is different from path expression syntax.
It's not JSONPath syntax
JSONPath syntax is documented elsewhere and looks like:
SELECT JSON_QUERY(json_text, '$.class.students[0]')
It's not a syntax for accessing JSON object graph
There's a separate JSON subscript operator syntax, which looks like so:
SELECT json_value.class.students[0]['name']
My current hypothesis
My best guess is that BigQuery doesn't actually support such syntax, and the description in the docs is a mistake.
But please, prove me wrong. I'd really like to know because I'm trying to write a parser for BigQuery SQL, and to do so, I need to understand the whole syntax that BigQuery allows.
I believe that a "path expression" is the combination of identifiers that points to specific objects/tables/columns/etc. So `project.dataset.table.struct.column` is a path expression comprising of 5 identifiers. I also think that alias.column within the context of a query is a path expression with 2 identifiers (although the alias is probably expanded behind the scenes).
If you scroll up a bit in your link, there is a section with some examples of valid path expressions, which also happens to be right after the identifiers section.
With this in mind, I think a JSON path expression is a certain type of path expression, as parsing JSON requires a specific set of identifiers to get to a specific data element.
As for the "graph" terminology, perhaps BQ parses the query and accesses data using a graph methodology behind the scenes, I can't really say. I would guess "path expressions" probably makes more sense to the team working on BigQuery rather than users using BigQuery. I don't think there is any special syntax for you to "use" path expressions.
If you are writing a parser, maybe take some inspiration from this ZetaSQL parser, which has several references to path expressions.
Looks this syntax comes from ZetaSQL parser, which includes the exact same documentation. BigQuery most likely uses ZetaSQL internally as its parser (ZetaSQL supports all of BigQuery syntax and they're both from Google).
According to ZetaSQL grammar a path expression beginning with / and containing : and - can be used for referencing tables in FROM clause. Looks like the / and : are simply part of identifier names, like the - is part of identifier names in BigQuery.
But the support for the : and / characters in ZetaSQL path expressions can be toggled on or off, and it seems that in BigQuery it's been toggled off. BigQuery doesn't allow : and / characters in table names - not even when they're quoted.
ZetaSQL also allows to toggle the support of - in identifier names, which BigQuery does allow.
My conclusion: it's a ZetaSQL parser feature, the documentation of which has been mistakenly copy-pasted to BigQuery documentation.
Thanks to rtenha for pointing out the ZetaSQL parser, of which I wasn't aware before.

Combining Clang AST

I'm trying to work on the AST of multiple files at a go using RecursiveASTVisitor and found this method buildASTs from ClangTool that is said to Create an AST for each file specified in the command line and append them to ASTs.
However, I am unable to find examples of use or guides.
Anyone has experience with combining ASTs from multiple source?
What I've done now is this
ClangTool Tool(OptionsParser.getCompilations(), OptionsParser.getSourcePathList());
std::vector<std::unique_ptr<clang::ASTUnit>> AST;
Tool.buildASTs(AST);
But I don't know how to proceed with the analysis from here..
If you need to combine ASTs, you can merge parts of an AST into another one using clang::ASTImporter.
However the most common strategy is to analyze each AST independently and then merge the results together.

Compatibility issue with old lex-yacc code in new flex-bison

I am on a migration project to move a C++ application from HP-UX to redhad 6.4 server. Now there is a parser application written using lex-yacc, which works fine in HP-UX. Now once we moved the lex specification file (l file) and yacc specification file (y file) to the RHEL 6.4 server, we compiled the code into new system without much change. But the generated parser is not working, everytime it is giving some syntax error with same input file which is correctly parsed in HP-UX. Now as per some reference material on lex and flex incompatibility, there are below points I see in the l file -
It has redefined input, unput and output methods.
The yylineno variable is initialized, and incremented in the redifined input method when '\n' character is found.
The data in lex is read from standard input cin, which looks to be in scanner mode.
How can I find out the possible incompatibilities and remedies for this issue? And is there any way other than using gdb to debug the parser?
Both flex and yacc/bison have useful trace features which can aid in debugging grammars.
For flex, simply regenerate the scanner with the -d option, which will cause a trace line to be written to stderr every time a pattern is matched (whether or not it generates a token). I'm not sure how line number tracking by the debugging option will work with your program's explicit yylineno manipulation, but I guess that is just cosmetic. (Personally, unless you have a good reason not to, I'd just let flex track line numbers.)
For bison, you need to both include tracing code in the parser, and enable tracing in the executable. You do the former with the -t command-line option, and the latter by assigning a non-zero value to the global variable yydebug. See the bison manual for more details and options.
These options may or may not work with the HPUX tools, but it would be worth trying because that will give you two sets of traces which you can compare.
You don't want to debug the generated code, you want to debug the parser (lex/yacc code).
I would first verify the lexer is returning the same stream of tokens on both platforms.
Then reduce the problem. You know the line of input that the syntax error occurs on. Create a stripped down parser that supports parsing the contents of that line, and if you can't figure out what is going on from that, post the reduced code.

SQL parser library - get table names from query

I'm looking for a C/C++ SQL parsing library which is able to provide me with the names of the tables the query depends on.
What I expect:
SELECT * FROM TABLEA NATURAL JOIN TABLEB
Result: TABLEA, TABLEB
Certainly provided example is extremly simple. I've already written my own parser (based on Boost.Spirit) which handles a subset of SQL grammar, but what I need is a parser which is able to handle complicated (recursive etc.) queries.
Do you know anything useful for this purpose?
What I found is http://www.sqlparser.com - it's commercial but does exactly what I need.
I also digged into the PostgreSQL sources, no effect.
Antlr can produce a nice SQL parser (the source of the parser can be C++) for you, and there is few SQL grammars for it available: http://www.antlr3.org/grammar/list.html
If all you are interested in are the table names, then taking one of those grammars and adding a semantic action collecting those names should be fairly easy.
Having some experience with Antlr and Bison/Yacc & Lex/Flex I definitely recommend Antlr. It is written in Java, but the target language can be C++ - the generated code is actually readable, looks like written by a human. The debugging of Antlr generated parsers is quite OK, which cannot be said about those generated by Bison..
There are other options, like for example Lemon and sqlite grammar, have a look at this question if you like: SQL parser in C

building c++ config file parser using lex and yacc

I am trying to build config file parser (c++ application)from scratch using tools like lex and yacc. The parser will be able to parse files like
# Sub group example
petName = Tommy
Owner = {
pet = "%petName%"
}
Is there any step by step guide/link to articles on how to achieve this using tools like lex and yacc? The idea is I will write a class say Config (c++) with methods like getConfig(string propName). If I invoke like config.getConfig(Owner.pet), it will return me Tommy.
Boost Property Tree
It was designed for configuration files. It does reading, writing in the following formats:
INI
INFO
XML
JSON
Here is the five minute tutorial page which should give you a good idea:
http://www.boost.org/doc/libs/1_47_0/doc/html/boost_propertytree/tutorial.html