How are Path Expressions used in BigQuery - google-cloud-platform

BigQuery documentation describes path expressions, which look like this:
foo.bar
foo.bar/25
foo/bar:25
foo/bar/25-31
/foo/bar
/25/foo/bar
But it doesn't say a lot about how and where these path expressions are used. It only briefly mentions:
A path expression describes how to navigate to an object in a graph of objects.
But what is this graph of objects?
How would you use this syntax with a graph of objects?
What's the meaning of a path expression like foo/bar/25-31?
My question is: what are these Path Expressions the official documentation describes?
I've searched through BigQuery docs but haven't managed to find any other mention of these path expressions. Is this syntax actually part of BigQuery SQL at all?
What I've found out so far
There is an existing question, which asks roughly the same thing, but for some reason it's downvoted and none of the answers are correct. Though the question it asks is more about a specific detail of the path expression syntax.
Anyway, the answers there propose a few hypotheses as to what path expressions are:
It's not a syntax for referencing tables
The BigQuery Legacy SQL uses syntax that's similar to path expressions for referencing tables:
SELECT state, year FROM [bigquery-public-data:samples.natality]
But that syntax is only valid in BigQuery Legacy SQL. In the new Google Standard SQL it produces a syntax error. There's a separate documentation for table path syntax, which is different from path expression syntax.
It's not JSONPath syntax
JSONPath syntax is documented elsewhere and looks like:
SELECT JSON_QUERY(json_text, '$.class.students[0]')
It's not a syntax for accessing JSON object graph
There's a separate JSON subscript operator syntax, which looks like so:
SELECT json_value.class.students[0]['name']
My current hypothesis
My best guess is that BigQuery doesn't actually support such syntax, and the description in the docs is a mistake.
But please, prove me wrong. I'd really like to know because I'm trying to write a parser for BigQuery SQL, and to do so, I need to understand the whole syntax that BigQuery allows.

I believe that a "path expression" is the combination of identifiers that points to specific objects/tables/columns/etc. So `project.dataset.table.struct.column` is a path expression comprising of 5 identifiers. I also think that alias.column within the context of a query is a path expression with 2 identifiers (although the alias is probably expanded behind the scenes).
If you scroll up a bit in your link, there is a section with some examples of valid path expressions, which also happens to be right after the identifiers section.
With this in mind, I think a JSON path expression is a certain type of path expression, as parsing JSON requires a specific set of identifiers to get to a specific data element.
As for the "graph" terminology, perhaps BQ parses the query and accesses data using a graph methodology behind the scenes, I can't really say. I would guess "path expressions" probably makes more sense to the team working on BigQuery rather than users using BigQuery. I don't think there is any special syntax for you to "use" path expressions.
If you are writing a parser, maybe take some inspiration from this ZetaSQL parser, which has several references to path expressions.

Looks this syntax comes from ZetaSQL parser, which includes the exact same documentation. BigQuery most likely uses ZetaSQL internally as its parser (ZetaSQL supports all of BigQuery syntax and they're both from Google).
According to ZetaSQL grammar a path expression beginning with / and containing : and - can be used for referencing tables in FROM clause. Looks like the / and : are simply part of identifier names, like the - is part of identifier names in BigQuery.
But the support for the : and / characters in ZetaSQL path expressions can be toggled on or off, and it seems that in BigQuery it's been toggled off. BigQuery doesn't allow : and / characters in table names - not even when they're quoted.
ZetaSQL also allows to toggle the support of - in identifier names, which BigQuery does allow.
My conclusion: it's a ZetaSQL parser feature, the documentation of which has been mistakenly copy-pasted to BigQuery documentation.
Thanks to rtenha for pointing out the ZetaSQL parser, of which I wasn't aware before.

Related

Apache Calcite - Parse a query with ## variables

Not sure where else to go so thought I would ask the group.
I have Apache Calcite working as a SQL parser for my application - I am trying to parse MySQL. However it is unable to handle certain SQL statements. The issue seems to be with any SQL statement that contains a variable denoted by "##" so something like :
SELECT ##session.auto_increment_increment AS auto_increment_increment
fails in the parser. I appreciate this is MySQL specific but was wondering if there is a way to "handle" these ##'s to at least get them into the Node tree so I can provide a more useful response than throw an exception.
There is an open request for this feature, CALCITE-5066. Probably the best way to "handle" these ## variables is to implement the feature.
I'm not being facetious. A quick 'hack' solution will likely trip up if ## characters appear in comments or character literals. So it's better to handle this by modifying the parser. And once you've modified the parser, if you want it to stay working you should write tests and contribute it back to the project.

Get runtime ColdFusion syntax trees?

Is it possible to get access to / modify ColdFusion syntax trees at run time?
I'd wager not, and a 10 minute google search didn't find anything. Fiddling with closures and writing metadata dumps, we can see stringified versions of objects like [runtime expression], for example in the following:
function x(a=b+1) {}
WriteDump(getMetaData(x).parameters[1]["default"]);
Does it allow us to go no deeper than this, or perhaps someone knows how to keep digging and start walking trees?
Default UDF parameter expressions aren't available in function metadata as you've found. Other libraries that have implemented some form of CFML parser are
CFLint (written in Java and using ANTLR)
https://github.com/cflint/CFLint
CFFormat (also uses a binary compiled from Rust)
https://www.forgebox.io/view/commandbox-cfformat
Function LineNums (pure CFML)
https://www.forgebox.io/view/funclinenums
There is also a function callStackGet() docs: https://cfdocs.org/callstackget which might be useful to whatever you are trying to do.
And another CFML parser (written in CFML) here: https://github.com/foundeo/cfmlparser

OpenLDAP regex search with shell script

For an OpenLDAP database I need to find all users that have a telephone number matching a regex pattern, and are in a given Organizational Unit.
According to this: LDAP search using regular expression it is impossible by an ldapsearch (what would have been my first choice otherwise).
I would like to do the least possible work on clientside, and querying all users from an organizational unit and filter them by a grep or something similar seems too resource consuming. Is there a better way to do it?
Also I'm not very familiar with shell, so I'm a little afraid of "sed", but I heard it's powerful and performs well in a regex filtering. If I'd need to do the filtering client side which would be the easiest way (not compromising performance)?
And about batched inputs. If I get a lot of partial phone numbers in a CSV file, and each partial number could have the type "prefix"/"postfix"/"regex" (so it's tow coloumns: type, and partialnumber), what would be the best performance-wise?
Should I just get all the users in the organization unit and filter them by the shell script (iterating through all the users and trying to match any of the numbers)?
Or should I make a query for every number (this is only a viable option if regex filter for attributes is possible in an ldap query).
At my level of knowledge the first one is the way to go but is there a better solution?
I'm using OpenLDAP 2.4.23 it that matters in any way.
The results of using regular expressions with LDAP data might not be what you expect. LDAP data is not strings, but specific types of data defined by the schema, and application must always retrieve the schema to learn how to deal with attribute values. The telephoneNumber attribute has a specific syntax, and regular expressions may not work. In general, matching rules must be used by LDAP clients to compare and match data sooted in a directory server. In fact, best practices are that applications must alway sure matching rules, not native language comparison operators or regular expressions. For more information, please see LDAP: Programming Practices and LDAP: Using Matching Rules.

File system regular expression search tool

What is the best tool to make complex (multi-line) regular expression file contents searches with good reporting capabilities?
I need to make a report over large Java/JSP code base and I have to make some charts afterward.
Eclipse is rather good at searches, but it does not provide good report of what is found. It just shows the tree of files, but I would like to see a table with columns corresponding to full match, each group, file name, file path, file date, may some version control information etc. Then I can transfer this table to Excel and make some graphs that I want.
Is there some generic file system search tool that has such capabilities? Or maybe there is some Eclispe plugin that can give better reports (note that I'm stuck on eclipse 3.1.2)?
Agent Ransack, TextPad, and UltraEdit allow you to perform regular expression searches against the file system. My favorite is Agent Ransack as you can specify regular expressions for the file names and for the content.
PowerGREP (on Windows) can be used to do (most of) that. You can define the format of your search results quite freely. I haven't tried yet to also add file meta information to the search results, but that should work. Not sure if you can add version control information (where would that come from?) - perhaps if you could be a bit more specific, I could check.
Other than that, why not write a small Python/Ruby/Perl script like JasonTrue suggested?
For searches over code bases with queries that understand the language structure, look at SD Search Engine. This tool indexes larges source base to provide very fast query response.
Queries are stated in terms of langauge elements (identifiers, operators, strings, ...) with constraints over the language elements (including wildcards and regexps on identifiers, strings and comments, as well as range constraints on numbers). Language whitespace and linebreaks (and comments unless you insist) are ignored.
If you want to do a plain regexp search on file character content, you can do that too but you don't get the speed advantage of the index, runs more like regular grep.
The interactive query result is shown in a hit window with other hits; by clicking, you can go to window containg the full source code of a hit.
In logging mode, all hits found are written to a log file with N lines of context, where you configure N. That's probably the report you want.
um... grep -r ?
Or ruby/perl/python, if you want to have more control over the final output; it sounds like what you're after would only be a few lines.

Allowing code snippets in form input while preventing XSS and SQL injection attacks

How can one allow code snippets to be entered into an editor (as stackoverflow does) like FCKeditor or any other editor while preventing XSS, SQL injection, and related attacks.
Part of the problem here is that you want to allow certain kinds of HTML, right? Links for example. But you need to sanitize out just those HTML tags that might contain XSS attacks like script tags or for that matter even event handler attributes or an href or other attribute starting with "javascript:". And so a complete answer to your question needs to be something more sophisticated than "replace special characters" because that won't allow links.
Preventing SQL injection may be somewhat dependent upon your platform choice. My preferred web platform has a built-in syntax for parameterizing queries that will mostly prevent SQL-Injection (called cfqueryparam). If you're using PHP and MySQL there is a similar native mysql_escape() function. (I'm not sure the PHP function technically creates a parameterized query, but it's worked well for me in preventing sql-injection attempts thus far since I've seen a few that were safely stored in the db.)
On the XSS protection, I used to use regular expressions to sanitize input for this kind of reason, but have since moved away from that method because of the difficulty involved in both allowing things like links while also removing the dangerous code. What I've moved to as an alternative is XSLT. Again, how you execute an XSL transformation may vary dependent upon your platform. I wrote an article for the ColdFusion Developer's Journal a while ago about how to do this, which includes both a boilerplate XSL sheet you can use and shows how to make it work with CF using the native XmlTransform() function.
The reason why I've chosen to move to XSLT for this is two fold.
First validating that the input is well-formed XML eliminates the possibility of an XSS attack using certain string-concatenation tricks.
Second it's then easier to manipulate the XHTML packet using XSL and XPath selectors than it is with regular expressions because they're designed specifically to work with a structured XML document, compared to regular expressions which were designed for raw string-manipulation. So it's a lot cleaner and easier, I'm less likely to make mistakes and if I do find that I've made a mistake, it's easier to fix.
Also when I tested them I found that WYSIWYG editors like CKEditor (he removed the F) preserve well-formed XML, so you shouldn't have to worry about that as a potential issue.
The same rules apply for protection: filter input, escape output.
In the case of input containing code, filtering just means that the string must contain printable characters, and maybe you have a length limit.
When storing text into the database, either use query parameters, or else escape the string to ensure you don't have characters that create SQL injection vulnerabilities. Code may contain more symbols and non-alpha characters, but the ones you have to watch out for with respect to SQL injection are the same as for normal text.
Don't try to duplicate the correct escaping function. Most database libraries already contain a function that does correct escaping for all characters that need escaping (e.g. this may be database-specific). It should also handle special issues with character sets. Just use the function provided by your library.
I don't understand why people say "use stored procedures!" Stored procs give no special protection against SQL injection. If you interpolate unescaped values into SQL strings and execute the result, this is vulnerable to SQL injection. It doesn't matter if you are doing it in application code versus in a stored proc.
When outputting to the web presentation, escape HTML-special characters, just as you would with any text.
The best thing that you can do to prevent SQL injection attacks is to make sure that you use parameterized queries or stored procedures when making database calls. Normally, I would also recommend performing some basic input sanitization as well, but since you need to accept code from the user, that might not be an option.
On the other end (when rendering the user's input to the browser), HTML encoding the data will cause any malicious JavaScript or the like to be rendered as literal text rather than executed in the client's browser. Any decent web application server framework should have the capability.
I'd say one could replace all < by <, etc. (using htmlentities on PHP, for example), and then pick the safe tags with some sort of whitelist. The problem is that the whitelist may be a little too strict.
Here is a PHP example
$code = getTheCodeSnippet();
$code = htmlentities($code);
$code = str_ireplace("<br>", "<br>", $code); //example to whitelist <br> tags
//One could also use Regular expressions for these tags
To prevent SQL injections, you could replace all ' and \ chars by an "innofensive" equivalent, like \' and \, so that the following C line
#include <stdio.h>//'); Some SQL command--
Wouldn't have any negative results in the database.