BigQuery - JSONpath recursive operator (2/2) - google-cloud-platform

Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Motivation: access "name" only when located within "students" in the below.
Query
SELECT JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
Desired output
+-----------------+
| first_student |
+-----------------+
| "Jane" |
+-----------------+
Current output
Unsupported operator in JSONPath: ..

Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Consider below approach
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://some_bucket/jsonpath-0.8.0.js"
);
SELECT CUSTOM_JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
with output
Note: to overcome current BigQuery's "limitation" for JsonPath, above solution uses UDF + external library - jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://some_bucket/jsonpath-0.8.0.js

Related

BigQuery - JSONpath recursive operator (1/2)

Is there any way to perform a recursive search on a JSON string object in BigQuery in absence of the operator ".." which is apparently not supported ?
Motivation: access "name" without knowing "students" and "class" in the below.
Query
SELECT JSON_EXTRACT(json_text, '$..name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
Desired output
+-----------------+
| first_student |
+-----------------+
| "Jane" |
+-----------------+
Current output
Unsupported operator in JSONPath: ..
Try below
SELECT REGEXP_EXTRACT(json_text, r'"name" : "(\w+)"') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
with output

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester
You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

dc:Creator string literal vs. regex FILTER in SPARQL

I am using Europeana's Virtuoso SPARQL Endpoint.
I have been trying to search in SPARQL for content about a specific contributor. To my understanding, this could be carried out this way:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator 'Picasso' .
}
Nevertheless, I get nothing in return.
Alternatively, I used FILTER regex to search for the literal.
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title ?creator
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
This actually worked very well and returned correctly the results.
My question is: Is it possible to produce the SPARQL query without using FILTER to search the work of a particular artist?
Many thanks.
I don't think there are any objects with 'Picasso' literally as the creator. So a regex filter is a good choice, but slow.
Here's a way to find the strings your regex is matching:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?creator, (count(?creator) as ?ccount)
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
group by ?creator
order by ?ccount
It might have been easier for you to see that if your had displayed all variables in the select statement:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
If you don't want to use a regex filter, you could enumerate all of the Picasso variants you are looking for:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
values ?creator { "Picasso, Pablo" "Pablo Picasso" } .
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator
}
bif:contains works on this endpoint and is pretty fast:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
?creator bif:contains 'Picasso'
#FILTER regex(?creator, 'Picasso')
}
1) Your first query has unconnected triple patterns.
2) I guess and according to the vocabulary description, dc:creator expects a resource, i.e. a URI. Using the URI of the entity Picasso doesn't work?
+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Term Name: creator | |
| URI: | http://purl.org/dc/elements/1.1/creator |
| Label: | Creator |
| Definition: | An entity primarily responsible for making the resource. |
| Comment: | Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity. |
+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
It would good to see your data in order to decide whether FILTER on literals is necessary or not.

Amazon Data Pipeline: How to use a script argument in a SqlActivity?

When trying to use a Script Argument in the sqlActivity:
{
"id" : "ActivityId_3zboU",
"schedule" : { "ref" : "DefaultSchedule" },
"scriptUri" : "s3://location_of_script/unload.sql",
"name" : "unload",
"runsOn" : { "ref" : "Ec2Instance" },
"scriptArgument" : [ "'s3://location_of_unload/#format(minusDays(#scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", "'aws_access_key_id=????;aws_secret_access_key=*******'" ],
"type" : "SqlActivity",
"dependsOn" : { "ref" : "ActivityId_YY69k" },
"database" : { "ref" : "RedshiftCluster" }
}
where the unload.sql script contains:
unload ('
select *
from tbl1
')
to ?
credentials ?
delimiter ',' GZIP;
or :
unload ('
select *
from tbl1
')
to ?::VARCHAR(255)
credentials ?::VARCHAR(255)
delimiter ',' GZIP;
process fails:
syntax error at or near "$1" Position
Any idea what i'm doing wrong?
This is the script that works fine from psql shell :
insert into tempsdf select * from source where source.id = '123';
Here are some of my tests on SqlActivity using Data-Pipelines :
Test 1 : Using ?'s
insert into mytable select * from source where source.id = ?; - works fine if used via both 'script' and 'scriptURI' option on SqlActivity object.
where "ScriptArgument" : "123"
here ? can replace the value of the condition, but not the condition itself.
Test 2 : Using parameters works when command is specified using 'script' option only
insert into #{myTable} select * from source where source.id = ?; - Works fine if used via 'script' option only
insert into #{myTable} select * from source where source.id = #{myId};
works fine if used via 'script' option only
where #{myTable} , #{myId} are Parameters whose value can be declared in template.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html
(when you are using only parameters, make sure you delete an unused
scriptArguments - otherwise it will still throw and error)
FAILED TESTS and inferences:
insert into ? select * from source where source.id = ?;
insert into ? select * from source where source.id = '123';
Both the above commands does not work because
Table names cannot be used for placeholders for script arguments. '?''s can only be used to pass values for a comparison condition and column values.
insert into #{myTable} select * from source where source.id = #{myId}; - doesn't work if used as 'SciptURI'
insert into tempsdf select * from source where source.id = #{myId}; - does not work when used with 'ScriptURI'
Above 2 commands does not work because
Parameters cannot be evaluated if script is stored in S3.
insert into tempsdf select * from source where source.id = $1 ; - doesnt work with 'scriptURI'
insert into tempsdf values ($1,$2,$3); - does not work.
using $'s - doesn't not work in any combination
Other tests :
"ScriptArgument" : "123"
"ScriptArgument" : "456"
"ScriptArgument" : "789"
insert into tempsdf values (?,?,?); - works as both scriptURI , script and translates to insert into tempsdf values ('123','456','789');
scriptArguments will follow the order you insert and replaces "?" in
the script.
in shellcommand activity
we specify two scriptArguments to acces using $1 $2 in shell script(.sh)
"scriptArgument" : "'s3://location_of_unload/#format(minusDays(#scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", # can be accesed using $1
"scriptArgument" : "'aws_access_key_id=????;aws_secret_access_key=*******'" # can be accesed using $2
I dont know will this work for you.
I believe you are using this sql activity for Redshift. Can you modify your sql script to refer to parameters using their positional notation.
To refer to the parameters in the sql statement itself, use $1, $2, etc.
See http://www.postgresql.org/docs/9.1/static/sql-prepare.html

retrieving the class name of a specific subclass in owl

I am an rdflib beginner, i have an ontology with classes and sub-classes and I need to look for a specific word in a subclass and, if it is found, return its class name.
I have the following code:
import rdflib
from rdflib import plugin
from rdflib.graph import Graph
g = Graph()
g.parse("test.owl")
from rdflib.namespace import Namespace
plugin.register(
'sparql', rdflib.query.Processor,
'rdfextras.sparql.processor', 'Processor')
plugin.register(
'sparql', rdflib.query.Result,
'rdfextras.sparql.query', 'SPARQLQueryResult')
qres = g.query("""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?subject ?object
WHERE { ?subject rdfs:subClassOf ?object }
""")
# n is asubclass name and its class name is good-behaviour which i want to be the result
n="pity"
for (subj,pred,obj) in qres:
if n in subj:
print obj
else:
print "not found"
When I print the result of qres it returns a complete URL, and I need the name only of the sub-class and the class.
Can anyone help with this.
You can use RDFLib without SPARQL and Python string manipulation to get your answer. If you prefer to use SPARQL, the Joshua Taylor answer to this question would be the way to go. You also don't need the SPARQL processor plugin with recent versions (4+) of RDFLib - see the "Querying with SPARQL" documentation.
To get the answer you are looking for you can use the RDFLIB Graph method subject_objects to get a generator of subjects and objects with the predicate you are interested in, rdfs:subClassOf. Each subject and object will be an RDFLib URIRef, which are also Python unicode objects that can be manipulated using standard Python methods. To get the suffix of the IRI call the split method of the object and take the last item in the returned list.
Here is your code reworked to do as described. Without the data, I can't fully test it but this did work for me when using a different ontology.
from rdflib import Graph
from rdflib.namespace import RDFS
g = Graph()
g.parse("test.owl")
# n is a subclass name and its class name is good-behaviour
# which i want to be the result
n = "pity"
for subj, obj in g.subject_objects(predicate=RDFS.subClassOf):
if n in subj:
print obj.rsplit('#')[-1]
else:
print 'not found'
You haven't shown your data, so I can't use your exact query or data, but based on your comments, it sounds like you're getting IRIs (e.g., http://www.semanticweb.org/raya/ontologies/test6#Good-behaviour) as results, and you want just the string Good-behaviour. You can use strafter to do that. For instance, if you had data like this:
#prefix : <http://stackoverflow.com/questions/20830056/> .
#prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
:retrieving-the-class-name-of-a-specific-subclass-in-owl
rdfs:label "retrieving the class name of a specific subclass in owl"#en .
Then a query like this will return results that have full IRIs:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?question where {
?question rdfs:label ?label .
}
---------------------------------------------------------------------------------------------------------
| question |
=========================================================================================================
| <http://stackoverflow.com/questions/20830056/retrieving-the-class-name-of-a-specific-subclass-in-owl> |
---------------------------------------------------------------------------------------------------------
You can use strafter to get the part of a string after some other string. E.g.,
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?q where {
?question rdfs:label ?label .
bind(strafter(str(?question),"http://stackoverflow.com/questions/20830056/") as ?q)
}
-------------------------------------------------------------
| q |
=============================================================
| "retrieving-the-class-name-of-a-specific-subclass-in-owl" |
-------------------------------------------------------------
If you define the prefix in the query, e.g., as a so:, then you can also use str(so:) instead of the string form. If you prefer, you can also do the string manipulation in the variable list rather than the graph pattern. That would look like this:
prefix so: <http://stackoverflow.com/questions/20830056/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select (strafter(str(?question),str(so:)) as ?q) where {
?question rdfs:label ?label .
}
-------------------------------------------------------------
| q |
=============================================================
| "retrieving-the-class-name-of-a-specific-subclass-in-owl" |
-------------------------------------------------------------