What is a "path_expression" in BigQuery - google-cloud-platform

BigQuery describes a path_expression in the Syntax page as follows:
A path expression describes how to navigate to an object in a graph of objects and generally follows this structure...
Examples:
foo.bar
foo.bar/25
foo/bar:25
foo/bar/25-31
/foo/bar
/25/foo/bar
What are some actual examples of a Path Expression with a valid table, for example in a CTE? My thinking was that a path expression would be something that explicitly qualifies a field or struct sub-field, such as:
myTable.myField.mySubfield
But from the above syntax, which allows for:
/:-
I'm not exactly sure what it is or how it would be used. Could someone show a real-world example of how a path expression would be used?

Path expressions describe where to search for (or store) data. Consider this example from the docs:
#legacySQL
SELECT
weight_pounds, state, year, gestation_weeks
FROM
[bigquery-public-data:samples.natality]
ORDER BY
weight_pounds DESC
LIMIT
10;
In this example, the portion [bigquery-public-data:samples.natality] is the path_expression.
It is saying to look at the natality table in the samples database in the bigquery-public-data project.
But from the above syntax, which allows for: /:-
This would actually not be allowed, as a path_expression.
This example would be parsed as:
/:-
{first_part}/{subsequent_part}:{subsequent_part}
{{ unquoted_identifier | quoted_identifier }} /
{{ unquoted_identifier | quoted_identifier | number }} :
{{ unquoted_identifier | quoted_identifier | number }}
An unquoted_identifier must at least begin with a letter or an underscore. A quoted_identifier can contain any character, but cannot be empty. The empty string therefore cannot be considered an unquoted_identifier, quoted_identifier, or number, so this expression is invalid (and is invalid in 3 positions).
A possible minimal path_expression could be something like:
a:b.c
meaning look in the c table in the b database in the a project.

I can be wrong, but I think you are confusing Path Expression (as it is defined in referenced documentation) with something like (for example) JSONPath.
In my mind, the the former is just terminology that introduced in order to have consistent reference within the documentation, while later (obviously) is query language for JSON, similar to XPath for XML, etc.
So, if you would asked for something like JSONPath use - I bet you would get tones of usage examples, but asking about Path Expression use has not much chances to gather answers even having hefty bounty as it is ONLY used in that documentation.

Related

monetdb regexp select

I'm doing some testing with MonetDB.
The gist of the query I'm trying perform (using borrowed syntax) goes like this:
SELECT mystring FROM mytable WHERE mystring REGEXP 'myxpression';
MonetDB does not support this syntax, but the docs claim that it supports PCRE, so this may be possible, still the syntax eludes me.
Check the Does MonetDB support regular expression predicates?
The implementation is there in the MonetDB backend, the module that
implements it is pcre (to be found in MonetDB5 source tree).
I'm not sure whether it is available by default from MonetDB/SQL.
If not, with these two function definition, you link SQL functions to the
respective implementations in MonetDB5:
-- case sensitive
create function pcre_match(s string, pattern string)
returns BOOLEAN
external name pcre.match;
-- case insensitive
create function pcre_imatch(s string, pattern string)
returns BOOLEAN
external name pcre.imatch;
If you need more, I'd suggest to have a look at MonetDB5/src/modules/mal/
pcre.mx in the source code.
Use select name from sys.functions; to check if the function exists, otherwise you will need to create it.
As an example, you may use pcre_imatch() like this:
SELECT mystring FROM mytable WHERE pcre_imatch(mystring, 'myexpression');

Query string degenerate cases

I am looking around looking for a correct regualr expression for validating URI query strings. I found some answers here or here but I still have doubts on the edge cases, where the key or the value could be empty. For example, should be the following treated as valid query strings?
?&&
?=
?a=
?a=&
?=a
?&=a
I am looking [...] for a correct regular expression for [valid] URI query strings.
Sure thing, no prob. As per RFC 3986, appendix B, here it is:
^([^#]*)$
If you want something more elaborate, you can check section 3.4 for the allowed characters in addition to percent-encoded entities. The regex would look something like this:
^(%[[:xdigit:]]{2}|[[:print:]])*$
As far as RFC 3986 is concerned, all your examples are valid so far. The RFC is telling us how the query string has to be encoded while saying little about how the query string has to be structured. Older RFCs are continuously shifting authority over the structure of query strings between CGI and HTTP without ever formally specifying a grammar (see e.g. RFC 3875, sec. 4.1.7, RFC 2396, sec. 3.4, RFC 1808, sec. 2.1, …).
An interesting note can be found in RFC 7230, section 2.4:
Applications MUST NOT directly specify the syntax of queries, as this can cause operational difficulties for deployments that do not support a particular form of a query.
[…]
HTML constrains the syntax of query strings used in form submission. New form languages SHOULD NOT emulate it, but instead allow creation of a broader variety of URIs
For a full validity check on such query strings, you would have to implement the algorithm for decoding formdata recommended by the W3C. Could be done in regex, but I would advise against it for reasons of sanity.
With regard to your examples: I believe they are all valid. How they are interpreted should be left to the receiving application. Some are not as much of a fringe case as you may think, though: ?&& is simply an empty dictionary while ?=a could map to { "": "a" }.

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

Is there a way to search terms in order with RegexpQuery in lucene?

I would like to search my indexed documents in order using RegexpQuery.
For example I have 2 Document
text: Oracle unveils better than expected quarterly results.
text: Research In Motion shares gained almost 13 per cent on the Toronto Stock Exchange Friday, a day after the smartphone maker posted better than expected quarterly results.
So far I tried this but I got no luck.
Query regexq = new RegexpQuery(new Term("text", "^.+better.+quarterly.+results"));
Is there another way of implementing this?
Thanks
I believe a PhraseQuery fits what you are looking for better. You can use PhraseQuery.setSlop(int) to allow terms to appear between the terms of the query. This would like like:
Query pq = new PhraseQuery();
pq.add(new Term("text", "better"));
pq.add(new Term("text", "quarterly"));
pq.add(new Term("text", "results"));
pq.setSlop(10); //Or whatever is an appropriate slop value for you.
This sort of query is also supported by the standard QueryParser, as seen here, like:
text:"better quarterly results"~10
I think a PhraseQuery is most definitely the better implementation here, but...
Regarding RegexpQuery:
I believe it is intended to compare terms against the regex, and since the phrase you are searching for (I am assuming) is tokenized, no single Term matches your whole regex. You would need to index the entire field as a single Term to make this work, using StringField, KeywordAnalyzer, or similar.
I believe it works like Matcher.matches(), rather than Matcher.find(), which is to say, it must match the entire input term, rather than a portion of it. So, if you had specified "text" as a StringField, you would need to add a .* to the end to consume the rest of the input.
On a similar note, I'm not sure if it supports the use of the character "^" as the start of input, being that it is redundant in that case. I don't see it specified in Lucene's Regexp, but I have seen reference to it's use, so I'm not sure whether it would be accepted or not.
To summarize, a RegexpQuery could work like:
Query regexq = new RegexpQuery(new Term("text", ".+better.+quarterly.+results.*"));
If you used a StringField, or KeywordAnalyzer index the entire field as a single Term.
With the leading wildcard in your regexp, though, you could expect very poor performance from it (See the warning at the top of the RegexpQuery documentation).

Ontology-based string classification

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:
String
|_ AlphabeticString
|_ CountryName
|_ CityName
|_ AlphaNumericString
|_ PrefixedNumericString
|_ NumericString
Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.
However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".
Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?
An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.
An outline of a process utilizing this approach might be:
Define a feature set you can extract from each string, relating to your ontology (some examples below).
Collect a "train set" of strings and their true matching categories.
Extract features from each string, and train some machine-learning algorithm on this data.
Use the trained model to classify new strings.
Retrain or update your model as needed (e.g. when new categories are added).
To illustrate more concretely, here are some suggestions based on your ontology example.
Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.
So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).
This set of feature will represent any given string.
to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.
Then the trained models can be used via Weka either from within Java or as an external command line.
Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.
I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.
Here are some expressions that you could use:
AlphabeticString:
^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)
AlphaNumericString:
^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)
PrefixedNumericString:
^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)
NumericString:
^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)
A particular string is an instance, so you'll need some code to make the basic assertions about the particular instance. That code itself might contain the use of regular expressions. Once you've got those assertions, you'll be able to use your ontology to reason about them.
The hard part is that you've got to decide what level you're going to model at. For example, are you going to talk about individual characters? You can, but it's not necessarily sensible. You've also got the challenge that arises from the fact that negative information is awkward (as the basic model of such models is intuitionistic, IIRC) which means (for example) that you'll know that a string contains a numeric character but not that it is purely numeric. Yes, you'd know that you don't have an assertion that the instance contains an alphabetic character, but you wouldn't know whether that's because the string doesn't have one or just because nobody's said so yet. This stuff is hard!
It's far easier to write an ontology if you know exactly what problems you intend to solve with it, as that allows you to at least have a go at working out what facts and relations you need to establish in the first place. After all, there's a whole world of possible things that could be said which are true but irrelevant (“if the sun has got his hat on, he'll be coming out to play”).
Responding directly to your question, you start by checking whether a given token is numeric, alphanumeric or alphabetic (you can use regex here) and then you classify it as such. In general, the approach you're looking for is called generalization hierarchy of tokens or hierarchical feature selection (Google it). The basic idea is that you could treat each token as a separate element, but that's not the best approach since you can't cover them all [*]. Instead, you use common features among tokens (for example, 2000 and 1981 are distinct tokens but they share a common feature of being 4 digit numbers and possibly years). Then you have a class for four digit numbers, another for alphanumeric, and so on. This process of generalization helps you to simplify your classification approach.
Frequently, if you start with a string of tokens, you need to preprocess them (for example, remove punctuation or special symbols, remove words that are not relevant, stemming, etc). But maybe you can use some symbols (say, punctuation between cities and countries - e.g. Melbourne, Australia), so you assign that set of useful punctuation symbols to other symbol (#) and use that as a context (so the next time you find an unknown word next to a comma next to a known country, you can use that knowledge to assume that the unknown word is a city.
Anyway, that's the general idea behind classification using an ontology (based on a taxonomy of terms). You may also want to read about part-of-speech tagging.
By the way, if you only want to have 3 categories (numeric, alphanumeric, alphabetic), a viable option would be to use edit distance (what is more likely, that UA4E30 belongs to the alphanumeric or numeric category, considering that it doesn't correspond to the traditional format of prefixed numeric strings?). So, you assume a cost for each operation (insertion, deletion, subtitution) that transforms your unknown token into a known one.
Finally, although you said you're using Protege (which I haven't used) to build your ontology, you may want to look at WordNet.
[*] There are probabilistic approaches that help you to determine a probability for an unknown token, so the probability of such event is not zero. Usually, this is done in the context of Hidden Markov Models. Actually, this could be useful to improve the suggestion given by etov.