SQL parser library - get table names from query - c++

I'm looking for a C/C++ SQL parsing library which is able to provide me with the names of the tables the query depends on.
What I expect:
SELECT * FROM TABLEA NATURAL JOIN TABLEB
Result: TABLEA, TABLEB
Certainly provided example is extremly simple. I've already written my own parser (based on Boost.Spirit) which handles a subset of SQL grammar, but what I need is a parser which is able to handle complicated (recursive etc.) queries.
Do you know anything useful for this purpose?
What I found is http://www.sqlparser.com - it's commercial but does exactly what I need.
I also digged into the PostgreSQL sources, no effect.

Antlr can produce a nice SQL parser (the source of the parser can be C++) for you, and there is few SQL grammars for it available: http://www.antlr3.org/grammar/list.html
If all you are interested in are the table names, then taking one of those grammars and adding a semantic action collecting those names should be fairly easy.
Having some experience with Antlr and Bison/Yacc & Lex/Flex I definitely recommend Antlr. It is written in Java, but the target language can be C++ - the generated code is actually readable, looks like written by a human. The debugging of Antlr generated parsers is quite OK, which cannot be said about those generated by Bison..
There are other options, like for example Lemon and sqlite grammar, have a look at this question if you like: SQL parser in C

Related

How are Path Expressions used in BigQuery

BigQuery documentation describes path expressions, which look like this:
foo.bar
foo.bar/25
foo/bar:25
foo/bar/25-31
/foo/bar
/25/foo/bar
But it doesn't say a lot about how and where these path expressions are used. It only briefly mentions:
A path expression describes how to navigate to an object in a graph of objects.
But what is this graph of objects?
How would you use this syntax with a graph of objects?
What's the meaning of a path expression like foo/bar/25-31?
My question is: what are these Path Expressions the official documentation describes?
I've searched through BigQuery docs but haven't managed to find any other mention of these path expressions. Is this syntax actually part of BigQuery SQL at all?
What I've found out so far
There is an existing question, which asks roughly the same thing, but for some reason it's downvoted and none of the answers are correct. Though the question it asks is more about a specific detail of the path expression syntax.
Anyway, the answers there propose a few hypotheses as to what path expressions are:
It's not a syntax for referencing tables
The BigQuery Legacy SQL uses syntax that's similar to path expressions for referencing tables:
SELECT state, year FROM [bigquery-public-data:samples.natality]
But that syntax is only valid in BigQuery Legacy SQL. In the new Google Standard SQL it produces a syntax error. There's a separate documentation for table path syntax, which is different from path expression syntax.
It's not JSONPath syntax
JSONPath syntax is documented elsewhere and looks like:
SELECT JSON_QUERY(json_text, '$.class.students[0]')
It's not a syntax for accessing JSON object graph
There's a separate JSON subscript operator syntax, which looks like so:
SELECT json_value.class.students[0]['name']
My current hypothesis
My best guess is that BigQuery doesn't actually support such syntax, and the description in the docs is a mistake.
But please, prove me wrong. I'd really like to know because I'm trying to write a parser for BigQuery SQL, and to do so, I need to understand the whole syntax that BigQuery allows.
I believe that a "path expression" is the combination of identifiers that points to specific objects/tables/columns/etc. So `project.dataset.table.struct.column` is a path expression comprising of 5 identifiers. I also think that alias.column within the context of a query is a path expression with 2 identifiers (although the alias is probably expanded behind the scenes).
If you scroll up a bit in your link, there is a section with some examples of valid path expressions, which also happens to be right after the identifiers section.
With this in mind, I think a JSON path expression is a certain type of path expression, as parsing JSON requires a specific set of identifiers to get to a specific data element.
As for the "graph" terminology, perhaps BQ parses the query and accesses data using a graph methodology behind the scenes, I can't really say. I would guess "path expressions" probably makes more sense to the team working on BigQuery rather than users using BigQuery. I don't think there is any special syntax for you to "use" path expressions.
If you are writing a parser, maybe take some inspiration from this ZetaSQL parser, which has several references to path expressions.
Looks this syntax comes from ZetaSQL parser, which includes the exact same documentation. BigQuery most likely uses ZetaSQL internally as its parser (ZetaSQL supports all of BigQuery syntax and they're both from Google).
According to ZetaSQL grammar a path expression beginning with / and containing : and - can be used for referencing tables in FROM clause. Looks like the / and : are simply part of identifier names, like the - is part of identifier names in BigQuery.
But the support for the : and / characters in ZetaSQL path expressions can be toggled on or off, and it seems that in BigQuery it's been toggled off. BigQuery doesn't allow : and / characters in table names - not even when they're quoted.
ZetaSQL also allows to toggle the support of - in identifier names, which BigQuery does allow.
My conclusion: it's a ZetaSQL parser feature, the documentation of which has been mistakenly copy-pasted to BigQuery documentation.
Thanks to rtenha for pointing out the ZetaSQL parser, of which I wasn't aware before.

Parsing a CSV string while ignoring commas inside the individual columns

I am trying to split a csv string with comma as delimiter.
val string ="A,B,"Hi,There",C,D"
I cannot use string.split(",") because it will split "Hi,There" as two different columns. Can I use regex to solve this? I came around scala-csv parser which I dont want to use. I hope there is a better method to solve this problem.I know this is not a trivial problem. It'll be helpful if people can share their approaches to solve this problem.
I agree with Jeronimo Backes, csv parsing is not trivial and it much better to use a library rather than reinvent the wheel.
Besides uniVocity-parsers there are some other more scala orientated libraries available (underlying parser indicated):
product-collections (native / scala)
scala-csv (native / java-scala)
mighty-csv (opencsv)
PureCSV (opencsv)
object-csv (scala-csv)
product-collections, my own project, is well tested against the same data as univocity and also against csv spectrum. It is strongly typed, reflection free and compatible with scala-js. It's tested for performance against most of the java equivalents.
The other projects listed all have their strengths. Scala-csv is very difficult to call from java without shims, so although I've tested it internally I was not able to make a pull request against csv-parsers-comparison.
Product-collections used to leverage opencsv but in order to make it scala-js compatible it now contains a native parser. The parser performs better than opencsv (speed, correctness) in all the scenarios I tested.
Use uniVocity-parsers CsvParser for that instead of parsing it by hand. CSV is much harder than you think and there are many corner cases to cover. You just found one. In short, you NEED a library to read CSV reliably. uniVocity-parsers is used by other Scala projects (e.g. spark-csv)
I'll put an example using plain Java here, because I don't know Scala, but you'll get the idea:
public static void main(String ... args){
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the documentation
CsvParser parser = new CsvParser(settings);
String[] row = parser.parseLine("A,B,\"Hi,There\",C,D");
for(String value : row){
System.out.println(value);
}
}
Output:
A
B
Hi,There
C
D
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
This regex covers your example, and possibly others, but certainly not robust:
(?:,?(".+?"))|(?:,?(.+?),?)
Here'a a demo on regex101: https://regex101.com/r/wM7uW4/1

Regx to exclude elements in an xml file

I am comparing two xml files using win merge. The files are deployment files and im looking for variation between the environments. The main issue is that the xml files are littered with tags that indicate a change in underlying id e.g. 123 but this is unimportant for comparing.
I want to create a regex that i can use in winmerge to exclude elements to compare only the interesting elements. e.g. compare element in the example below
Environment 1
<table>
<tableInfo>
<tableId>293</tableId>
<name>Table Name New</name>
<repositoryId>0</repositoryId>
Environment 2
<table>
<tableInfo>
<tableId>965</tableId>
<name>Table Name Old</name>
<repositoryId>0</repositoryId>
Please note that the application producing the xml spits these out in line by line order so it is not a true xml compare
I would not recommend using a regex for this... to do it truly accurately, you would really need to effectively parse the XML, which is really not something for which you want to use a regex.
Win Merge is a line-based diff tool, which really isn't necessarily wholly effective for XML. I would recommend trying an XML-based diff tool, which has more of a concept of XML's tree structure. Most XML-based diff tools appear to be commercial products, but there is diffxml, which is open source, and may be worth a look.
If you can get an XML-based diff of the files, which should inherently be more accurate, since they are not wholly line-based, and take the tree structure into account, you could then further delve into the diffs using an XML parser, such as ElementTree in Python, specifically targeting the tags you consider to be interesting and comparing them to each other to see if they are different.
If diffxml proves to be too unwieldy, it may be worth just doing the parsing using ElementTree or similar (i.e. lxml) and doing the comparison yourself against the two different sources targeted just at the tags in which you are interested.
In short, I think XML parsers, perhaps in combination with a XML-aware diff tool, will be more useful than pure regexes in this case.

How to start using xml with C++

(Not sure if this should be CW or not, you're welcome to comment if you think it should be).
At my workplace, we have many many different file formats for all kinds of purposes. Most, if not all, of these file formats are just written in plain text, with no consistency. I'm only a student working part-time, and I have no experience with using xml in production, but it seems to me that using xml would improve productivity, as we often need to parse, check and compare these outputs.
So my questions are: given that I can only control one small application and its output (only - the inputs are formats that are used in other applications as well), is it worth trying to change the output to be xml-based? If so, what are the best known ways to do that in C++ (i.e., xml parsers/writers, etc.)? Also, should I also provide a plain-text output to make it easy for the users (which are also programmers) to get used to xml? Should I provide a script to translate xml-plaintext? What are your experiences with this subject?
Thanks.
Don't just use XML because it's XML.
Use XML because:
other applications (that only accept XML) are going to read your output
you have an hierarchical data structure that lends itself perfectly for XML
you want to transform the data to other formats using XSL (e.g. to HTML)
EDIT:
A nice personal experience:
Customer: your application MUST be able to read XML.
Me: Er, OK, I will adapt my application so it can read XML.
Same customer (a few days later): your application MUST be able to read fixed width files, because we just realized our mainframe cannot generate XML.
Amir, to parse an XML you can use TinyXML which is incredibly easy to use and start with. Check its documentation for a quick brief, and read carefully the "what it does not do" clause. Been using it for reading and all I can say is that this tiny library does the job, very well.
As for writing - if your XML files aren't complex you might build them manually with a string object. "Aren't complex" for me means that you're only going to store text at most.
For more complex XML reading/writing you better check Xerces which is heavier than TinyXML. I haven't used it yet I've seen it in production and it does deliver it.
You can try using the boost::property_tree class.
http://www.boost.org/doc/libs/1_43_0/doc/html/property_tree.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/tutorial.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser
It's pretty easy to use, but the page does warn that it doesn't support the XML format completely. If you do use this though, it gives you the freedom to easily use XML, INI, JSON, or INFO files without changing more than just the read_xml line.
If you want that ability though, you should avoid xml attributes. To use an attribute, you have to look at the key , which won't transfer between filetypes (although you can manually create your own subnodes).
Although using TinyXML is probably better. I've seen it used before in a couple of projects I've worked on, but don't have any experience with it.
Another approach to handling XML in your application is to use a data binding tool, such as CodeSynthesis XSD. Such a tool will generate C++ classes that hide all the gory details of parsing/serializing XML -- all that you see are objects corresponding to your XML vocabulary and functions that you can call to get/set the data, for example:
Person p = person ("person.xml");
cout << p.name ();
p.name ("John");
p.age (30);
ofstream ofs ("person.xml");
person (ofs, p);
Here's what previous SO threads have said on the topic. Please add others you know of that are relevant:
What is the best open XML parser for C++?
What is XML good for and when should i be using it?
What are good alternative data formats to XML?
BTW, before you decide on an XML parser, you may want to make sure that it will actually be able to parse all XML documents instead of just the "simple" ones, as discussed in this article:
Are you using a real XML parser?

library for doing diffs

I've been tasked with creating a tool that can diff and merge the configuration files for my company's product. The configurations are stored as either XML or URL-encoded strings. I'm looking for a library, preferably open source with a license compatible with commercial software, that can do these diffs. Our app is written in C++, so C++ libraries would be best, but I'm willing to look at libraries that are C#-specific since I can write a wrapper that exposes it to C++ via COM. Three-way diffs would be ideal, but two-way is acceptable. If it has an understanding of XML, that would also be a plus (since XML nodes can be reordered without changing the document, etc). Any library suggestions? Should I even consider writing my own diff tools in the hopes of giving it semantic knowledge of our formats?
Thanks to this similar question, I've already discovered this google library, which seems really great, but I'm still looking for other options. It also seems to be able to output the diffs in HTML format (using the <ins> and <del> tags that I didn't know existed before I discovered it), which could be really handy, but it seems to be a unified diff only. I'm going to need to display the results in a web browser, and probably have to build an interface for doing the merges in the browser as well. I don't expect a library to be able to help with these tasks, but it must produce output in a format that is amenable to me building this on top of it. I'm currently envisioning something along the lines of TortoiseMerge (side-by-side diffs, not unified), except browser-based. Any tips/tricks/design ideas on how to present this would be appreciated too.
Subversion comes with libsvn_diff and libsvn_delta licensed under Apache Software License.
Here is a C++ library that can diff what the author calls semistructured data. It deals nicely with HTML and XML. Since your data is XML it would make a lot of sense to use this instead of plain text diff. This is especially the case when the files are machine generated.
I am currently trying to use this library to build a tool that diffs Visual Studio project files. These are basically XML files and using a plain diff tool like Winmerge is too painful because Visual Studio pretty much mucks up the whole file by crazy reordering. The idea is to do some kind of a structured diff to address the problem.
For diffing the XML I would propose that you normalize it first: sort all the elements in alphabetic order, then generate a stream of tokens/xml that represents the original document but is independent of the original formatting. After running the diff, parse the result to get a tree containing what was added / removed.