Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline - amazon-web-services

I have json file on S3, I want to transfer it to Redshift. One catch is that the file contains entries in such a format:
{
"user_id":1,
"metadata":
{
"connection_type":"WIFI",
"device_id":"1234"
}
}
Before I will save it to Redshift I want to flatten the file to contain columns:
user_id | connection_type | device_id
How can I do this using AWS Data Pipeline?
Is there activity that can transform json to the desired form? I do not think that transform sql will support json fields.

You do not need to flatten it. You can load it with the copy command after defining a jsonpaths config file to easily extract the column values from each json object.
With your structure you'd create a file in S3 (s3://bucket/your_jsonpaths.json) like so:
{
"jsonpaths": [
"$.user_id",
"$.metadata.connection_type",
"$.metadata.device_id"
]
}
Then you'd run something like this in Redshift:
copy your_table
from 's3://bucket/data_objects.json'
credentials '<aws-auth-args>'
json 's3://bucket/your_jsonpaths.json';
If you have issues see what is in the stv_load_errors table.
Check out the Redshift copy command and examples.

Related

Glue JSON serialization and athena query, return full record each field

I've been trying for a long time, through Glue's crawlers, to recognize .jsons from my S3, to be queried in Athena. But after different changes in settings, the best result I got, is still wrong.
Glue's crawler even recognizes the column structure of my .json, however, when queried in Athena, it sets up the columns found, but throws all items in the same line, one item for each column, as in images below.
My Classifier setting is "$[*]".
The .json data structure
[
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e3a", "airspace_p": 1061, "codedistv1": "SFC", "fid": 299 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e39", "airspace_p": 408, "codedistv1": "STD", "fid": 766 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e38", "airspace_p": 901, "codedistv1": "STD", "fid": 806 },
...
]
Configuration result in Glue:
Configuration result in Glue
Result in Athena from this table:
Result in Athena from this table
I already tried different .json structures, different classifiers, changed and added the JsonSerde

If you can change the data source, use the JSON lines format instead, then run the Glue crawler without any custom classifier.
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e3a","airspace_p": 1061,"codedistv1": "SFC","fid": 299}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e39","airspace_p": 408,"codedistv1": "STD","fid": 766}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e38","airspace_p": 901,"codedistv1": "STD","fid": 806}
Cause of your issue is that Athena doesn't support custom JSON classifier.

AWS Kinesis Data Firehose outputs file to S3 bucket. How/Where transform the data?

I have configured eventbridge rule to output(target) to Kinesis Data firehose and eventually Data firehose to S3 Bucket. Data is in JSON format.
Data is getting delivered to S3 bucket with no issues.
I have created a Glue Crawler pointing to S3 bucket which creates table schema/metadata in Glue Catalog and data in AWS Athena.
I am facing two issues currently :
Data written to S3 bucket by Firehose is writing it as Single line json which means if there are 5 records in json , AWS Athena queries only top record because it is not delimited by new line (\n). When records aren't separated by a newline character (\n) it will always return 1 record as per AWS Athena docs.
https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/
How I can transform these records to single line and where i should do this ? In firehose ?
2.In Json data , there are some columns with special characters and AWS Athena docs said column
cannot have special characters like (forward slash etc.)? How i can remove special
character/rename the column name , where i should transform this data ? In ETL Glue Job ? After
crawler has created the table schema ?
I understand in Kinesis Firehose we can transform source records with AWS Lambda but i am not able to solve above 2 issues with it.
Update:
I was able to resolve Issue first by creating lamda function and calling from Data firehose. What logic I can include in same lamda to remove special character from the KeyName. Code below :
Example - in json if the value is like "RelatedAWSResources:0/type": "AWS::Config::ConfigRule" -- Remove Special Character from the KeyName only make it like "RelatedAWSResources0type": "AWS::Config::ConfigRule"
import json
import boto3
import base64
output = []
def lambda_handler(event, context):
for record in event['records']:
payload = base64.b64decode(record['data']).decode('utf-8')
print('payload:', payload)
row_w_newline = payload + "\n"
print('row_w_newline type:', type(row_w_newline))
row_w_newline = base64.b64encode(row_w_newline.encode('utf-8'))
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': row_w_newline
}
output.append(output_record)
print('Processed {} records.'.format(len(event['records'])))
return {'records': output}

Redshift: copy command Json data from s3

I have the following JSON data.
{
"recordid":"69",
"recordTimestamp":1558087302591,
"spaceId":"space-cd88557d",
"spaceName":"Kirtipur",
"partnerId":"Kirtipur",
"eventType":"event-location-update",
"eventlocationupdate":{
"event":{
"eventid":"event-qcTUrDAThkbPsXi438rRk",
"userId":"",
"tags":[
],
"mobile":"",
"email":"",
"gender":"OTHER",
"firstName":"",
"lastName":"",
"postalCode":"",
"optIns":[
],
"otherFields":[
],
"macAddress":"55:56:81🇧🇦a4:6d"
},
"location":{
"locationId":"location-bdfsfsf6a8d96",
"name":"Kirtipur Office - wireless",
"inferredLocationTypes":[
"NETWORK"
],
"parent":{
"locationId":"location-c39ffc49",
"name":"Kirtipur",
"inferredLocationTypes":[
"vianet"
],
"parent":{
"locationId":"location-8b47asdfdsf1c6a",
"name":"Kirtipur",
"inferredLocationTypes":[
"ROOT"
]
}
}
},
"ssid":"",
"rawUserId":"",
"visitId":"visit-ca04ds5secb8d",
"lastSeen":1558087081000,
"deviceClassification":"",
"mapId":"",
"xPos":1.8595887,
"yPos":3.5580606,
"confidenceFactor":0.0,
"latitude":0.0,
"longitude":0.0
}
}
I need to load this from the s3 bucket using the copy command. I have uploaded this file to my S3 bucket.
I have worked with copy command for csv files but have not worked with copy command on JSON files. I researched regarding json import via copy command but did not find solid helpful command examples.
I used the following code for my copy command.
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 'auto';
This did not insert any data.
Can anyone please help me with the copy command for such JSON?
Thanks and Regards

There are 2 scenarios (most probably 1st):
You want AWS's auto option to load from the s3 you provided in line 2. For that, you do:
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
json 'auto';
Use custom json loading paths (i.e. you don't want all paths automatically)
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 's3://vianet-test/vianet_PATHS.json';
Here, 's3://vianet-test/vianet_PATHS.json' contains all the specific JSON from the main location you want to look at.
Refer: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-copy-from-json

One issue I notice is the formatting. It is nicely formatted the way you shared which is good to see for us, but when loading it into Redshift via COPY command I generally trim the JSON by removing all 'new line' and blank spaces.

How do I set a schema on a BigQuery load job using a schema file stored in Google Cloud Storage using API?

I've created a python script to take a JSON file from a Google Cloud Storage bucket and load it into a dataset.
I'm having issues trying to specify the schema which is in the same bucket as a text file
My schema file is a txt file with the format Attribute:DataType,Attribute:DataType
This is what I have
job_config = bigquery.LoadJobConfig()
schema_uri = 'gs://<bucket-name>/FlattenedProduct_schema.txt'
schema = schema_uri
job_config.schema = schema
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = 'gs://<bucket-name>/FlattenedProduct_JSON.txt'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
location='US', # Location must match that of the destination dataset.
job_config=job_config) # API request

You'll need to read the text file yourself and turn it into the format schema required, which is List[google.cloud.bigquery.schema.SchemaField] – Schema of the destination table.
Example of the required schema:
from google.cloud.bigquery import SchemaField
schem = [
SchemaField('field1','STRING'),
SchemaField('field2', 'INTEGER'),
SchemaField('value', 'FLOAT')
]

From your given code, you called txt file from your bucket however you used source format as JSON (SourceFormat.NEWLINE_DELIMITED_JSON). You can review those lines to see if it works.
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = 'gs://<bucket-name>/FlattenedProduct_JSON.txt'
Or as a workaround , you can also try below command to call a JSON format file either located in your local machine or from GCS. Following command specify a schema when you load your data.
bq --location=[LOCATION] load --source_format=[FORMAT] [PROJECT_ID]:[DATASET].[TABLE] [PATH_TO_DATA_FILE] [PATH_TO_SCHEMA_FILE]
Where:
[LOCATION] is the name of your location. The --location flag is optional if your data is in the US or the EU multi-region location. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
[FORMAT] is NEWLINE_DELIMITED_JSON or CSV.
[PROJECT_ID] is your project ID.
[DATASET] is the dataset that contains the table into which you're loading data.
[TABLE] is the name of the table into which you're loading data.
[PATH_TO_DATA_FILE] is the location of the CSV or JSON data file on your local machine or in Google Cloud Storage.
[SCHEMA_FILE] is the path to the schema file on your local machine.
Alternatively you can also specify a schema in the API.
To specify a schema when you load data, call the jobs.insert method and configure the configuration.load.schema property. Specify your region in the location property in the jobReference section.
To specify a schema when you create a table, call the tables.insert method and configure the schema in the table resource using the schema property.
Please click here to see detail of those options.

Delimiter not found error - AWS Redshift Load from s3 using Kinesis Firehose

I am using Kinesis firehose to transfer data to Redshift via S3.
I have a very simple csv file that looks like this. The firehose puts it to s3 but Redshift errors out with Delimiter not found error.
I have looked at literally all posts related to this error but I made sure that delimiter is included.
File
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:23:56.986397,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:02.061263,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:07.143044,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:12.217930,848.78
OR
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:48:59.993260","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:07.034945","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:12.306484","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:18.020833","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:24.203464","852.12"
Redshift Table
CREATE TABLE stockvalue
( symbol VARCHAR(4),
streamdate VARCHAR(20),
writedate VARCHAR(26),
stockprice VARCHAR(6)
);
Error
Error
Just in case, here's what my kinesis stream looks like
Firehose
Can someone point out what may be wrong with the file.
I added a comma between the fields.
All columns in destination table are varchar so there should be no reason for datatype error.
Also, the column lengths match exactly between the file and redshift table.
I have tried embedding columns in double quotes and without.

Can you post the full COPY command? It's cut off in the screenshot.
My guess is that you are missing DELIMITER ',' in your COPY command. Try adding that to the COPY command.

I was stuck on this for hours, and thanks to Shahid's answer it helped me solve it.
Text Case for Column Names is Important
Redshift will always treat your table's columns as lower-case, so when mapping JSON keys to columns, make sure the JSON keys are lower-case, e.g.
Your JSON file will look like:
{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}
And the COPY statement will look like
COPY latency(id,name) FROM 's3://<bucket-name>/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST json 'auto';
Settings within Firehose must have the column names specified (again, in lower-case). Also, add the following to Firehose COPY options:
json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull
How to call put_records from Python:
Below is a snippet showing how to use the put_records functions with kinesis in python:
'objects' passed into the 'put_to_stream' function is an array of dictionaries:
def put_to_stream(objects):
records = []
for metric in metrics:
record = {
'Data': json.dumps(metric),
'PartitionKey': 'swat_report'
};
records.append(record)
print(records)
put_response = kinesis_client.put_records(StreamName=kinesis_stream_name, Records=records)
flush
``

1- You need to add FORMAT AS JSON 's3://yourbucketname/aJsonPathFile.txt'. AWS has not mentioned this already. Please note that this only works when your data is in JSON form like
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}
2- You also needs to verify the column order in kinesis firehouse and in csv file.and try adding
TRUNCATECOLUMNS blanksasnull emptyasnull
3- An example
COPY testrbl3 ( eventId,serverTime,pageName,action,ip,userAgent,location,plateform,language,campaign,content,source,medium,productID,colorCode,scrolltoppercentage) FROM 's3://bucketname/' CREDENTIALS 'aws_iam_role=arn:aws:iam:::role/' MANIFEST json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline - amazon-web-services

Related

Glue JSON serialization and athena query, return full record each field

AWS Kinesis Data Firehose outputs file to S3 bucket. How/Where transform the data?

Redshift: copy command Json data from s3

How do I set a schema on a BigQuery load job using a schema file stored in Google Cloud Storage using API?

Delimiter not found error - AWS Redshift Load from s3 using Kinesis Firehose

Categories

Resources