How to create array of nested fields and arrays in BigQuery - google-cloud-platform

I am trying to create a table in BigQuery according to a json schema which I will put in GCS and push to a pub/sub topic from there. I need to create some arrays and nested fields in order to achieve that.
By using struct and array_agg I can achieve arrays of struct but I couldn't figure out how to create struct of array.
Imagine that I have a json schema as below:
{
"vacancies": {
"id": "12",
"timestamp": "2019-08-22T04:04:26Z",
"version": "1.0",
"positionOpening": {
"documentId": {
"value": "505"
},
"statusCode": "Closed",
"registrationDate": "2014-05-07T16:11:22Z",
"lastUpdated": "2014-05-07T16:14:56Z",
"positionProfiles": [
{
"positionTitle": "Data Scientist for international company",
"positionQualifications": [
{
"experienceSummary": [
{"measure": {"value": "10","unitCode": "ANN"}},
{"measure": {"value": "4","unitCode": "ANN"}}
],
"educationRequirement": {
"programs": ["Physics","Computer Science"],
"programConcentrations": ["Data Analysis","Python Programming"]
},
"languageRequirement": [
{
"competencyName": "English",
"requiredProficiencyLevel": {"scoresNumeric": [{"value": "100"},{"value": "95"}]}
},
{
"competencyName": "French",
"requiredProficiencyLevel": {"scoresNumeric": [{"value": "95"},{"value": "70"}]}
}
]
}
]
}
]
}
}
}
How can I create a SQL query to get this as a result?
Thanks in advance for the help!

You might have to build a temp table to do this.
This first create statement would take a denormalized table convert it to a table with an array of structs.
The second create statement would take that temp table and embed the array into a (array of) struct(s).
You could remove the internal struct from the first query, and array wrapper the second query to build a strict struct of arrays. But this should be flexibe enough that you can create an array of structs, a struct of arrays or any combination of the two as many times as you want up to the 15 levels deep that BigQuery allows you to max out at.
The final outcome of this could would be a table with one column (column1) of a standard datatype, as well as an array of structs called OutsideArrayOfStructs. That Struct has two columns of "standard" datatypes, as well as an array of structs called InsideArrayOfStructs.
CREATE OR REPLACE TABLE dataset.tempTable as (
select
column1,
column2,
column3,
ARRAY_AGG(
STRUCT(
ArrayObjectColumn1,
ArrayObjectColumn2,
ArrayObjectColumn3
)
) as InsideArrayOfStructs
FROM
sourceDataset.sourceTable
GROUP BY
column1,
column2,
column3 )
CREATE OR REPLACE TABLE dataset.finalTable as (
select
column1,
ARRAY_AGG(
STRUCT(
column2,
column3,
InsideArrayOfStructs
)
) as OutsideArrayOfStructs
FROM
dataset.tempTable
GROUP BY
Column1 )

Related

How to query list of maps in DynamoDB table

I have a dynamo db table with InvId (Primary Partition Key) and PgNo (Primary Sort Key). There is an item in the table called Details which is a list of maps and every map has an attribute called ChargeId. How can I query the map having a particular ChargeId? Can someone help me with a solution how can I design the table so that I can pass the InvId and ChargeId to fetch the particular item from the Details list?
{
"Anytime": 0,
"Details": [
{
"AccNum": "ACCZ4402255319",
"Amt": 49.67,
"ChargeId": 1652849999
},
{
"AccNum": "ACCZ4402255319",
"Amt": 50,
"ChargeId": 1652849991
},
{
"AccNum": "ACCZ4402255319",
"Amt": 49.67,
"ChargeId": 1652849992
},
{
"AccNum": "ACCZ4402255319",
"Amt": 50,
"ChargeId": 1652849993
}
],
"ExpTime": 253402300800,
"InvId": "305_40225614",
"PgNo": 1,
"SubsId": "406890"
}
You need to use a filter expression. It won't be index optimized so be careful.
See DynamoDB: How to use a query filter to check for conditions in a MAP for a code sample.

Create a table in AWS athena parsing dynamic keys in nested json

I have JSON files each line of the format below and I would like to parse this data and index it to a table using AWS Athena.
{
"123": {
"abc": {
"id": "test",
"data": "ipsum lorum"
},
"abd": {
"id": "test_new",
"data": "lorum ipsum"
}
}
}
Can a table with this format be created for the above data? In the documentation, it is mentioned that struct can be used for parsing nested JSON, however, there are no sample examples for dynamic keys.
You could cast JSON to map or array and transform it in any way you want. In this case you could use map_values and CROSS JOIN UNNEST to produce rows from JSON objects:
with test AS
(SELECT '{ "123": { "abc": { "id": "test", "data": "ipsum lorum" }, "abd": { "id": "test_new", "data": "lorum ipsum" } } }' AS str),
struct_like AS
(SELECT cast(json_parse(str) AS map<varchar,
map<varchar,
map<varchar,
varchar>>>) AS m
FROM test),
flat AS
(SELECT item
FROM struct_like
CROSS JOIN UNNEST(map_values(m)) AS t(item))
SELECT
key,
value['id'] AS id,
value['data'] AS data
FROM flat
CROSS JOIN unnest(item) AS t(key, value)
The result:
key id data
abc test ipsum lorum
abd test_new lorum ipsum

Azure Cosmos query to convert into List

This is my JSON data, which is stored into cosmos db
{
"id": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"keyData": {
"Keys": [
"Government",
"Training",
"support"
]
}
}
Now I want to write a query to eliminate the keyData and get only the Keys (like below)
{
"userid": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"Keys" :[
"Government",
"Training",
"support"
]
}
So far I tried the query like
SELECT c.id,k.Keys FROM c
JOIN k in c.keyPhraseBatchResult
Which is not working.
Update 1:
After trying with the Sajeetharan now I can able to get the result, but the issue it producing another JSON inside the Array.
Like
{
"id": "ee885fdc-9951-40e2-b1e7-8564003cd554",
"keys": [
{
"serving": "Government"
},
{
"serving": "Training"
},
{
"serving": "support"
}
]
}
Is there is any way that extracts only the Array without having key value pari again?
{
"userid": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"Keys" :[
"Government",
"Training",
"support"
]
}
You could try this one,
SELECT C.id, ARRAY(SELECT VALUE serving FROM serving IN C.keyData.Keys) AS Keys FROM C
Please use cosmos db stored procedure to implement your desired format based on the #Sajeetharan's sql.
function sample() {
var collection = getContext().getCollection();
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
'SELECT C.id,ARRAY(SELECT serving FROM serving IN C.keyData.Keys) AS keys FROM C',
function (err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
var response = getContext().getResponse();
response.setBody('no docs found');
}
else {
var response = getContext().getResponse();
var map = {};
for(var i=0;i<feed.length;i++){
var keyArray = feed[i].keys;
var array = [];
for(var j=0;j<keyArray.length;j++){
array.push(keyArray[j].serving)
}
feed[i].keys = array;
}
response.setBody(feed);
}
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
Output:

How to search comma separated data in mongodb

I have movie database with different fields. the Genre field contains a comma separated string like :
{genre: 'Action, Adventure, Sci-Fi'}
I know I can use regular expression to find the matches. I also tried:
{'genre': {'$in': genre}}
the problem is the running time. it take lot of time to return a query result. the database has about 300K documents and I have done normal indexing over 'genre' field.
Would say use Map-Reduce to create a separate collection that stores the genre as an array with values coming from the split comma separated string, which you can then run the Map-Reduce job and administer queries on the output collection.
For example, I've created some sample documents to the foo collection:
db.foo.insert([
{genre: 'Action, Adventure, Sci-Fi'},
{genre: 'Thriller, Romantic'},
{genre: 'Comedy, Action'}
])
The following map/reduce operation will then produce the collection from which you can apply performant queries:
map = function() {
var array = this.genre.split(/\s*,\s*/);
emit(this._id, array);
}
reduce = function(key, values) {
return values;
}
result = db.runCommand({
"mapreduce" : "foo",
"map" : map,
"reduce" : reduce,
"out" : "foo_result"
});
Querying would be straightforward, leveraging the queries with an multi-key index on the value field:
db.foo_result.createIndex({"value": 1});
var genre = ['Action', 'Adventure'];
db.foo_result.find({'value': {'$in': genre}})
Output:
/* 0 */
{
"_id" : ObjectId("55842af93cab061ff5c618ce"),
"value" : [
"Action",
"Adventure",
"Sci-Fi"
]
}
/* 1 */
{
"_id" : ObjectId("55842af93cab061ff5c618d0"),
"value" : [
"Comedy",
"Action"
]
}
Well you cannot really do this efficiently so I'm glad you used the tag "performance" on your question.
If you want to do this with the "comma separated" data in a string in place you need to do this:
Either with a regex in general if it suits:
db.collection.find({ "genre": { "$regex": "Sci-Fi" } })
But not really efficient.
Or by JavaScript evaluation via $where:
db.collection.find(function() {
return (
this.genre.split(",")
.map(function(el) {
return el.replace(/^\s+/,"")
})
.indexOf("Sci-Fi") != -1;
)
})
Not really efficient and probably equal to above.
Or better yet and something that can use an index, the separate to an array and use a basic query:
{
"genre": [ "Action", "Adventure", "Sci-Fi" ]
}
With an index:
db.collection.ensureIndex({ "genre": 1 })
Then query:
db.collection.find({ "genre": "Sci-Fi" })
Which is when you do it that way it's that simple. And really efficient.
You make the choice.

Filter duplicates in MongoDB C++

I am looking to find all duplicates in my collection by flagging duplicates based on the date. The following was my attempt but I am not sure how to use cmdResult within update. Any clues?
//filter duplicates
bson::bo cmdResult;
bool ok = c.runCommand(dbcol, BSON("distinct" << "date"), cmdResult);
c.update(dbcol,Query("date"<<cmdResult<<NOT<<"_id"), BSON("$set"<<BSON("noise"<<"true")), false, true);
The "distinct" command will return you a list of all unique "date" values there are in the collection. But what you need is a list of "date" values that occur more than once.
You can get this list using the aggregate command, by grouping by "date" and counting the entries, then matching for counts > 1:
aggregate([
{ $group: { "_id": "$name", count: {$sum:1} } },
{ $match: { $gt: [ count, 1 ] } }
])
You would then update your collection (multi:true) by querying for "date" IN that list, setting the "noise" field:
update( {"name": {$in: [<list>]} },{$set: {"noise": true} }, true, false )
For help on aggregation, see http://docs.mongodb.org/manual/reference/aggregation/