I have JSON files each line of the format below and I would like to parse this data and index it to a table using AWS Athena.
{
"123": {
"abc": {
"id": "test",
"data": "ipsum lorum"
},
"abd": {
"id": "test_new",
"data": "lorum ipsum"
}
}
}
Can a table with this format be created for the above data? In the documentation, it is mentioned that struct can be used for parsing nested JSON, however, there are no sample examples for dynamic keys.
You could cast JSON to map or array and transform it in any way you want. In this case you could use map_values and CROSS JOIN UNNEST to produce rows from JSON objects:
with test AS
(SELECT '{ "123": { "abc": { "id": "test", "data": "ipsum lorum" }, "abd": { "id": "test_new", "data": "lorum ipsum" } } }' AS str),
struct_like AS
(SELECT cast(json_parse(str) AS map<varchar,
map<varchar,
map<varchar,
varchar>>>) AS m
FROM test),
flat AS
(SELECT item
FROM struct_like
CROSS JOIN UNNEST(map_values(m)) AS t(item))
SELECT
key,
value['id'] AS id,
value['data'] AS data
FROM flat
CROSS JOIN unnest(item) AS t(key, value)
The result:
key id data
abc test ipsum lorum
abd test_new lorum ipsum
Related
Following this documents, I was trying to load JSON Data from S3 to RedShift.
Created JSONPath file & validated (on https://jsonpath.curiousconcept.com/# with expression $.*)
{
"jsonpaths": [
"$['_record_id']",
"$['_title']",
"$['_server_updated_at']",
"$['_project']",
"$['_assigned_to']",
"$['_updated_by']",
"$['_latitude']",
"$['_longitude']",
"$['date']",
"$['date_received']",
"$['inspection_type']"
]
}
and sample data
[{
"_record_id": "cf68c930-b7c8-4c3f-a04c-58b49f383cca",
"_title": "FAIL, 128",
"_server_updated_at": "2021-08-03T15:06:05.000Z",
"_project": null,
"_assigned_to": null,
"_updated_by": "XYZ",
"_geometry": {
"type": "Point",
"coordinates": [-74.5048900706, 40.3395964363]
},
"_latitude": 40.3395964363,
"_longitude": -74.5048900706,
"date": "2021-08-03T00:00:00.000Z",
"date_received": "2021-07-30T00:00:00.000Z",
"inspection_type": "New Product Inspection"
}, {
"_record_id": "9c8af79a-eaaf-405e-8c42-62560fdf15d5",
"_title": "PASS, 52",
"_server_updated_at": "2021-08-03T14:56:23.000Z",
"_project": null,
"_assigned_to": null,
"_updated_by": "XYZ",
"_geometry": null,
"_latitude": null,
"_longitude": null,
"date": "2021-08-03T00:00:00.000Z",
"date_received": "2021-07-30T00:00:00.000Z",
"inspection_type": "New Product Inspection"
}]
When I run this COPY command
copy rab.rab_dbo.shipmentreceivinglog2
from 's3://<bucket>/data_report.json'
iam_role 'arn:aws:iam::1234567890:role/RedshiftFileTransfer'
json 's3://<bucket>g/JSONPaths.json';
I get ERROR: Load into table 'shipmentreceivinglog2' failed. Check 'stl_load_errors' system table for details. When I run select * from stl_load_errors; I see
Invalid JSONPath format: Member is not an object. for s3://<bucket>/data_report.json
Whats wrong with my JSONPath File ?
The issue is with your data file. Redshift json input data needs to be a set of json records just smashed together. You have a file that is one json array of objects. An array is one thing. You need to take out the enclosing [] and the commas between elements. Your sample data should look like
{
"_record_id": "cf68c930-b7c8-4c3f-a04c-58b49f383cca",
"_title": "FAIL, 128",
"_server_updated_at": "2021-08-03T15:06:05.000Z",
"_project": null,
"_assigned_to": null,
"_updated_by": "XYZ",
"_geometry": {
"type": "Point",
"coordinates": [-74.5048900706, 40.3395964363]
},
"_latitude": 40.3395964363,
"_longitude": -74.5048900706,
"date": "2021-08-03T00:00:00.000Z",
"date_received": "2021-07-30T00:00:00.000Z",
"inspection_type": "New Product Inspection"
}
{
"_record_id": "9c8af79a-eaaf-405e-8c42-62560fdf15d5",
"_title": "PASS, 52",
"_server_updated_at": "2021-08-03T14:56:23.000Z",
"_project": null,
"_assigned_to": null,
"_updated_by": "XYZ",
"_geometry": null,
"_latitude": null,
"_longitude": null,
"date": "2021-08-03T00:00:00.000Z",
"date_received": "2021-07-30T00:00:00.000Z",
"inspection_type": "New Product Inspection"
}
An easy way to get this is to pump the json you have through jq.
jq '.[]' file.json
So, I access an API an receive data in JSON format as shown below. I'm looking to draw a chart with the dates on the "x" axis and the Data on the "y" axis. I don´t know what dates I will be getting, and they will be plenty. I want to save that data in an array/vector of objects that contain a structure containing the date and value at that date. Because this isn't a key-value relation, I do not know how to parse it, do you?
Structure
typedef struct
{
float Data;
string date;
}Data_t;
JSON received
"Information: Data": {
"2021-06-01 16:00:01": {
"Data": "139.5578"
},
"2021-05-28": {
"Data": "137.7645"
},
"2021-05-21": {
"Data": "135.8931"
},
"2021-05-14": {
"Data": "133.6110"
}
...
...
...
We are having a problem with the libname json, basically our dataset carries over the wrong field on object hierarchies.
here is a simple program:
filename resp "dataset.json";
filename pmap "map.json";
run;
libname example JSON fileref=resp map=pmap;
proc datasets lib=example;
run;
data objects;
set example.objects;
run;
the json dataset "dataset.json" looks like this:
{
"objects": [
{
"field": "wrong_answer"
},
{
"objectHierarchy": [
{
"map_to_this_level": "demo2"
},
{
"map_to_this_level": "demo1"
},
{
"map_to_this_level": "demo"
}
],
"field": "right_answer"
}
]
}
and the map "map.json" looks like this:
{
"DATASETS": [
{
"DSNAME": "objects",
"TABLEPATH": "/root/objects/objectHierarchy",
"VARIABLES": [
{
"NAME": "map_to_this_level",
"TYPE": "CHARACTER",
"PATH": "/root/objects/objectHierarchy/map_to_this_level",
"CURRENT_LENGTH": 10
},{
"NAME": "field1",
"TYPE": "CHARACTER",
"PATH": "/root/objects/field",
"CURRENT_LENGTH": 12
}
]
}
]
}
the resulting dataset "example.objects" looks like this:
map_to_this_level field
demo2 wrong_answer
demo1
demo right_answer
My question is why does the wrong_answer value from the field on the first object with an empty objectHierarchy get mapped onto the row of data from the next object with actual values for it's objectHierarchy field?
the data should look like this:
map_to_this_level field
demo2 right_answer
demo1
demo right_answer
I presume a human expectation of:
field1 map_to_this_level
------------ -----------------
wrong_answer <missing>
right_answer demo2
right_answer demo1
right_answer demo
The JSON library engine is a serial decoder. The order of the properties in the json being parsed do not mate well with the map specified and the operation of the internal map interpreter (i.e. the SAS JSON library engine black box)
Consider this example with these small changes:
in json the field comes before objectHierarchy
in json wrong_answer has an empty array objectHierarchy. Note: if the objectHierarchy was not present, no row would be output for wrong_answer
in map the field1 value is retained using SAS JSON map feature DATASETS/VARIABLES/OPTIONS:["RETAIN"]
filename response catalog "work.json.sandbox.source";
data _null_;
file response; input; put _infile_;
datalines4;
{
"objects": [
{
"field": "wrong_answer"
,
"objectHierarchy": []
},
{
"field": "right_answer"
,
"objectHierarchy": [
{
"map_to_this_level": "demo2"
},
{
"map_to_this_level": "demo1"
},
{
"map_to_this_level": "demo"
}
]
}
]
}
;;;;
run;
filename pmap catalog "work.json.pmap.source";
data _null_;
file pmap; input; put _infile_;
datalines4;
{
"DATASETS": [
{
"DSNAME": "objects",
"TABLEPATH": "/root/objects/objectHierarchy",
"VARIABLES": [
{
"NAME": "map_to_this_level",
"TYPE": "CHARACTER",
"PATH": "/root/objects/objectHierarchy/map_to_this_level",
"CURRENT_LENGTH": 10
},
{
"NAME": "field1",
"TYPE": "CHARACTER",
"PATH": "/root/objects/field",
"CURRENT_LENGTH": 12
, "OPTIONS": ["RETAIN"]
}
]
}
]
}
;;;;
run;
libname example JSON fileref=response map=pmap;
ods listing; options nocenter nodate nonumber formdlim='-'; title;
dm 'clear output';
proc datasets lib=example;
run;
proc print data=example.alldata;
run;
proc print data=example.objects;
run;
dm 'output';
Output
map_to_
this_
Obs level field1
1 wrong_answer
2 demo2 right_answer
3 demo1 right_answer
4 demo right_answer
If your json can not be trusted to be aligned with the mappings processed by SAS JSON library engine you will have to either:
work the the json provider, or
find an alternative interpreting mediary (Python, C#, etc) that can output modified json or an alternate interpreted form such as csv, that can be consumed by SAS.
I am trying to create a table in BigQuery according to a json schema which I will put in GCS and push to a pub/sub topic from there. I need to create some arrays and nested fields in order to achieve that.
By using struct and array_agg I can achieve arrays of struct but I couldn't figure out how to create struct of array.
Imagine that I have a json schema as below:
{
"vacancies": {
"id": "12",
"timestamp": "2019-08-22T04:04:26Z",
"version": "1.0",
"positionOpening": {
"documentId": {
"value": "505"
},
"statusCode": "Closed",
"registrationDate": "2014-05-07T16:11:22Z",
"lastUpdated": "2014-05-07T16:14:56Z",
"positionProfiles": [
{
"positionTitle": "Data Scientist for international company",
"positionQualifications": [
{
"experienceSummary": [
{"measure": {"value": "10","unitCode": "ANN"}},
{"measure": {"value": "4","unitCode": "ANN"}}
],
"educationRequirement": {
"programs": ["Physics","Computer Science"],
"programConcentrations": ["Data Analysis","Python Programming"]
},
"languageRequirement": [
{
"competencyName": "English",
"requiredProficiencyLevel": {"scoresNumeric": [{"value": "100"},{"value": "95"}]}
},
{
"competencyName": "French",
"requiredProficiencyLevel": {"scoresNumeric": [{"value": "95"},{"value": "70"}]}
}
]
}
]
}
]
}
}
}
How can I create a SQL query to get this as a result?
Thanks in advance for the help!
You might have to build a temp table to do this.
This first create statement would take a denormalized table convert it to a table with an array of structs.
The second create statement would take that temp table and embed the array into a (array of) struct(s).
You could remove the internal struct from the first query, and array wrapper the second query to build a strict struct of arrays. But this should be flexibe enough that you can create an array of structs, a struct of arrays or any combination of the two as many times as you want up to the 15 levels deep that BigQuery allows you to max out at.
The final outcome of this could would be a table with one column (column1) of a standard datatype, as well as an array of structs called OutsideArrayOfStructs. That Struct has two columns of "standard" datatypes, as well as an array of structs called InsideArrayOfStructs.
CREATE OR REPLACE TABLE dataset.tempTable as (
select
column1,
column2,
column3,
ARRAY_AGG(
STRUCT(
ArrayObjectColumn1,
ArrayObjectColumn2,
ArrayObjectColumn3
)
) as InsideArrayOfStructs
FROM
sourceDataset.sourceTable
GROUP BY
column1,
column2,
column3 )
CREATE OR REPLACE TABLE dataset.finalTable as (
select
column1,
ARRAY_AGG(
STRUCT(
column2,
column3,
InsideArrayOfStructs
)
) as OutsideArrayOfStructs
FROM
dataset.tempTable
GROUP BY
Column1 )
This is my JSON data, which is stored into cosmos db
{
"id": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"keyData": {
"Keys": [
"Government",
"Training",
"support"
]
}
}
Now I want to write a query to eliminate the keyData and get only the Keys (like below)
{
"userid": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"Keys" :[
"Government",
"Training",
"support"
]
}
So far I tried the query like
SELECT c.id,k.Keys FROM c
JOIN k in c.keyPhraseBatchResult
Which is not working.
Update 1:
After trying with the Sajeetharan now I can able to get the result, but the issue it producing another JSON inside the Array.
Like
{
"id": "ee885fdc-9951-40e2-b1e7-8564003cd554",
"keys": [
{
"serving": "Government"
},
{
"serving": "Training"
},
{
"serving": "support"
}
]
}
Is there is any way that extracts only the Array without having key value pari again?
{
"userid": "e064a694-8e1e-4660-a3ef-6b894e9414f7",
"Name": "Name",
"Keys" :[
"Government",
"Training",
"support"
]
}
You could try this one,
SELECT C.id, ARRAY(SELECT VALUE serving FROM serving IN C.keyData.Keys) AS Keys FROM C
Please use cosmos db stored procedure to implement your desired format based on the #Sajeetharan's sql.
function sample() {
var collection = getContext().getCollection();
var isAccepted = collection.queryDocuments(
collection.getSelfLink(),
'SELECT C.id,ARRAY(SELECT serving FROM serving IN C.keyData.Keys) AS keys FROM C',
function (err, feed, options) {
if (err) throw err;
if (!feed || !feed.length) {
var response = getContext().getResponse();
response.setBody('no docs found');
}
else {
var response = getContext().getResponse();
var map = {};
for(var i=0;i<feed.length;i++){
var keyArray = feed[i].keys;
var array = [];
for(var j=0;j<keyArray.length;j++){
array.push(keyArray[j].serving)
}
feed[i].keys = array;
}
response.setBody(feed);
}
});
if (!isAccepted) throw new Error('The query was not accepted by the server.');
}
Output: