Glue JSON serialization and athena query, return full record each field

Glue JSON serialization and athena query, return full record each field - amazon-web-services

I've been trying for a long time, through Glue's crawlers, to recognize .jsons from my S3, to be queried in Athena. But after different changes in settings, the best result I got, is still wrong.
Glue's crawler even recognizes the column structure of my .json, however, when queried in Athena, it sets up the columns found, but throws all items in the same line, one item for each column, as in images below.
My Classifier setting is "$[*]".
The .json data structure
[
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e3a", "airspace_p": 1061, "codedistv1": "SFC", "fid": 299 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e39", "airspace_p": 408, "codedistv1": "STD", "fid": 766 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e38", "airspace_p": 901, "codedistv1": "STD", "fid": 806 },
...
]
Configuration result in Glue:
Configuration result in Glue
Result in Athena from this table:
Result in Athena from this table
I already tried different .json structures, different classifiers, changed and added the JsonSerde

If you can change the data source, use the JSON lines format instead, then run the Glue crawler without any custom classifier.
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e3a","airspace_p": 1061,"codedistv1": "SFC","fid": 299}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e39","airspace_p": 408,"codedistv1": "STD","fid": 766}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e38","airspace_p": 901,"codedistv1": "STD","fid": 806}
Cause of your issue is that Athena doesn't support custom JSON classifier.

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?

Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}

I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).

AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}

Store invalid JSON columns are STRING or skip them in BigQuery

I have a JSON data file which looks something like below
{
"key_a": "value_a",
"key_b": "value_b",
"key_c": {
"c_nested/invalid.key.according.to.bigquery": "valid_value_though"
}
}
As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).
As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.
I presume my table definition would look something like below:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_c",
"type": "STRING"
}
]
When I try to create a table with this schema I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key
Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
}
]
The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:
Error while reading table:
projectname.dataset_name.table_name, error message:
JSON parsing error in row starting at position 0: No such field: key_c.
I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.
Is cleaning the data the only way? How can I tackle this?
I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?

One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.
With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:
with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select
JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline

I have json file on S3, I want to transfer it to Redshift. One catch is that the file contains entries in such a format:
{
"user_id":1,
"metadata":
{
"connection_type":"WIFI",
"device_id":"1234"
}
}
Before I will save it to Redshift I want to flatten the file to contain columns:
user_id | connection_type | device_id
How can I do this using AWS Data Pipeline?
Is there activity that can transform json to the desired form? I do not think that transform sql will support json fields.

You do not need to flatten it. You can load it with the copy command after defining a jsonpaths config file to easily extract the column values from each json object.
With your structure you'd create a file in S3 (s3://bucket/your_jsonpaths.json) like so:
{
"jsonpaths": [
"$.user_id",
"$.metadata.connection_type",
"$.metadata.device_id"
]
}
Then you'd run something like this in Redshift:
copy your_table
from 's3://bucket/data_objects.json'
credentials '<aws-auth-args>'
json 's3://bucket/your_jsonpaths.json';
If you have issues see what is in the stv_load_errors table.
Check out the Redshift copy command and examples.

Analytics in WSO2DAS

I'm getting a Table Not Found error while running a select query on spark console of wso2das. I've kept all the default configurations intact after the installation. I'm unable to fetch the data from the event stream even when it's been shown under table dropdown of data explorer.

Initially when the data is moved into the wso2das, it would be persisted in the data store you mention.
But, these are not the tables that are created in spark. You need to write a spark query to create a temporary table in spark which would reference the table you have persisted.
For example,
If your stream is,
{
"name": "sample",
"version": "1.0.0",
"nickName": "",
"description": "",
"payloadData": [
{
"name": "ID",
"type": "INT"
},
{
"name": "NAME",
"type": "STRING"
}
]
}
you need to write the following spark query in the spark console,
CREATE TEMPORARY TABLE sample_temp USING CarbonAnalytics OPTIONS (tableName "sample", schema "ID INT, NAME STRING");
after executing the above script,try the following,
select * from sample_temp;
This should fetch the data you have pushed into WSO2DAS.
Happy learning!! :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Glue JSON serialization and athena query, return full record each field - amazon-web-services

Related

AWS AppFlow - rename source field

How to specify attributes to return from DynamoDB through AppSync

Store invalid JSON columns are STRING or skip them in BigQuery

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline

Analytics in WSO2DAS

Categories

Resources