BigQuery - JSONpath recursive operator (1/2) - google-cloud-platform

Is there any way to perform a recursive search on a JSON string object in BigQuery in absence of the operator ".." which is apparently not supported ?
Motivation: access "name" without knowing "students" and "class" in the below.
Query
SELECT JSON_EXTRACT(json_text, '$..name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
Desired output
+-----------------+
| first_student |
+-----------------+
| "Jane" |
+-----------------+
Current output
Unsupported operator in JSONPath: ..

Try below
SELECT REGEXP_EXTRACT(json_text, r'"name" : "(\w+)"') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
with output

Related

BigQuery - JSONpath recursive operator (2/2)

Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Motivation: access "name" only when located within "students" in the below.
Query
SELECT JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
Desired output
+-----------------+
| first_student |
+-----------------+
| "Jane" |
+-----------------+
Current output
Unsupported operator in JSONPath: ..
Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Consider below approach
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://some_bucket/jsonpath-0.8.0.js"
);
SELECT CUSTOM_JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
with output
Note: to overcome current BigQuery's "limitation" for JsonPath, above solution uses UDF + external library - jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://some_bucket/jsonpath-0.8.0.js

AWS cli JMESpath query tag values

As a follow-up to my previous question, I am having trouble getting tags to work right in this type of output. I want to print a table with each property as a row. This is how I expect it to look:
% aws --output table ec2 describe-instances --instance-id $id --query "Reservations[].Instances[0].[{ Property: 'Type', Value: InstanceType }]"
-------------------------------
| DescribeInstances |
+-----------+-----------------+
| Property | Value |
+-----------+-----------------+
| Type | g4dn.12xlarge |
+-----------+-----------------+
But with tag names it looks like this:
% aws --output table ec2 describe-instances --instance-id $id --query "Reservations[].Instances[0].[{ Property: 'Name', Value: Tags[?Key =='Name'].Value }]"
-------------------
|DescribeInstances|
+-----------------+
| Property |
+-----------------+
| Name |
+-----------------+
|| Value ||
|+---------------+|
|| Elliott-TKD ||
|+---------------+|
The tag value is correct, but the formatting is weird, and when combined with more other rows the table gets really ugly.
The filter part of your query ([?Key == 'Name']) is creating what JMESPath is describing as a projection.
You will have to reset this projection in order to extract a single string out of it.
Resetting a projection can be achieved using pipes.
Projections are an important concept in JMESPath. However, there are times when projection semantics are not what you want. A common scenario is when you want to operate of the result of a projection rather than projecting an expression onto each element in the array.
For example, the expression people[*].first will give you an array containing the first names of everyone in the people array. What if you wanted the first element in that list? If you tried people[*].first[0] that you just evaluate first[0] for each element in the people array, and because indexing is not defined for strings, the final result would be an empty array, []. To accomplish the desired result, you can use a pipe expression, <expression> | <expression>, to indicate that a projection must stop.
Source: https://jmespath.org/tutorial.html#pipe-expressions
So your issue is very close to what they are describing here in the documentation and the reset of that projection can be achieved using:
Tags[?Key =='Name']|[0].Value
or, with:
Tags[?Key =='Name'].Value | [0]
which are two strictly identical queries.
Given the JSON:
{
"Reservations": [
{
"Instances": [
{
"Tags": [
{
"Key": "Name",
"Value": "Elliott-TKD"
},
{
"Key": "Foo",
"Value": "Bar"
}
]
}
]
}
]
}
The query
Reservations[].Instances[0].[{ Property: `Name`, Value: Tags[?Key == `Name`]|[0].Value }]
Will give the expected
[
[
{
"Property": "Name",
"Value": "Elliott-TKD"
}
]
]
So it will render properly in your table
------------------------------
| DescribeInstance |
+------------+---------------+
| Property | Value |
+------------+---------------+
| Name | Elliott-TKD |
+------------+---------------+

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester
You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

django rawsql postgres nested json/jsonb query

I have a jsonb structure on postgres named data where each row (there are around 3 million of them) looks like this:
[
{
"number": 100,
"key": "this-is-your-key",
"listr": "20 Purple block, THE-CITY, Columbia",
"realcode": "LA40",
"ainfo": {
"city": "THE-CITY",
"county": "Columbia",
"street": "20 Purple block",
"var_1": ""
},
"booleanval": true,
"min_address": "20 Purple block, THE-CITY, Columbia LA40"
},
.....
]
I would like to query the min_address field in the fastest possible way. In Django I tried to use:
APModel.objects.filter(data__0__min_address__icontains=search_term)
but this takes ages to complete (also, "THE-CITY" is in uppercase, so, I have to use icontains here. I tried dropping to rawsql like so:
cursor.execute("""\
SELECT * FROM "apmodel_ap_model"
WHERE ("apmodel_ap_model"."data"
#>> array['0', 'min_address'])
#> %s \
""",\
[json.dumps([{'min_address': search_term}])]
)
but this throws me strange errors like:
LINE 4: #> '[{"min_address": "some lane"}]'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
I am wondering what is the fastest way I can query the field min_address by using rawsql cursors.
Late answer, probably it won't help OP anymore. Also I'm not at all an expert in Postgres/JSONB, so this might be a terrible idea.
Given this setup;
so49263641=# \d apmodel_ap_model;
Table "public.apmodel_ap_model"
Column | Type | Collation | Nullable | Default
--------+-------+-----------+----------+---------
data | jsonb | | |
so49263641=# select * from apmodel_ap_model ;
data
-------------------------------------------------------------------------------------------
[{"number": 1, "min_address": "Columbia"}, {"number": 2, "min_address": "colorado"}]
[{"number": 3, "min_address": " columbia "}, {"number": 4, "min_address": "California"}]
(2 rows)
The following query "expands" objects from data arrays to individual rows. Then it applies pattern matching to the min_address field.
so49263641=# SELECT element->'number' as number, element->'min_address' as min_address
FROM apmodel_ap_model ap, JSONB_ARRAY_ELEMENTS(ap.data) element
WHERE element->>'min_address' ILIKE '%col%';
number | min_address
--------+---------------
1 | "Columbia"
2 | "colorado"
3 | " columbia "
(3 rows)
However, I doubt it will perform well on large datasets as the min_address values are casted to text before pattern matching.
Edit: Some great advice here on indexing JSONB data for search https://stackoverflow.com/a/33028467/1284043

dc:Creator string literal vs. regex FILTER in SPARQL

I am using Europeana's Virtuoso SPARQL Endpoint.
I have been trying to search in SPARQL for content about a specific contributor. To my understanding, this could be carried out this way:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator 'Picasso' .
}
Nevertheless, I get nothing in return.
Alternatively, I used FILTER regex to search for the literal.
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?title ?creator
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
This actually worked very well and returned correctly the results.
My question is: Is it possible to produce the SPARQL query without using FILTER to search the work of a particular artist?
Many thanks.
I don't think there are any objects with 'Picasso' literally as the creator. So a regex filter is a good choice, but slow.
Here's a way to find the strings your regex is matching:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?creator, (count(?creator) as ?ccount)
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
group by ?creator
order by ?ccount
It might have been easier for you to see that if your had displayed all variables in the select statement:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
FILTER regex(?creator, 'Picasso')
}
If you don't want to use a regex filter, you could enumerate all of the Picasso variants you are looking for:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
values ?creator { "Picasso, Pablo" "Pablo Picasso" } .
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator
}
bif:contains works on this endpoint and is pretty fast:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {
?objectInfo dc:title ?title .
?objectInfo dc:creator ?creator .
?creator bif:contains 'Picasso'
#FILTER regex(?creator, 'Picasso')
}
1) Your first query has unconnected triple patterns.
2) I guess and according to the vocabulary description, dc:creator expects a resource, i.e. a URI. Using the URI of the entity Picasso doesn't work?
+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Term Name: creator | |
| URI: | http://purl.org/dc/elements/1.1/creator |
| Label: | Creator |
| Definition: | An entity primarily responsible for making the resource. |
| Comment: | Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity. |
+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
It would good to see your data in order to decide whether FILTER on literals is necessary or not.