AWS Glue crawler - Getting "Internal Service Exception" on crawling json data - amazon-web-services

I am facing issues crawling data from S3 bucket.
File format is form.
When I try crawling this data from S3 I get "Internal Service Exception".
Can you please suggest a fix?
When I try loading the data directly from Athena, I see the following error for a field which is an array of strings:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key
Thanks,
..

There were spaces in the key names that I was using in the JSON.
{
...
"key Name" : "value"
...
}
I formatted my data to remove spaces from key names and converted all the keys to lower case.
{
...
"keyname" : "value"
...
}
This resolved the issue.

Related

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline

I have started trying out google cloud data fusion as a prospect ETL tool that I can finally decide to use.When building a data pipeline to fetch data from a REST API source and load it to a MySQL database am facing this error Expected a string but was NULL at line 1 column 221'. Please check the system logs for more details. and yes it's true I have a field that is null from the JSON response am seeing
"systemanswertime": null
How do I deal with null values since the available dropdown in the cloud data fusion studio string is not working are they other optional data types that I can use?
Below are two screenshots showing my current data pipeline structure
geneneral view
view showing mapping and the output schema
Thank You!!
What you need to do is to tell HTTP plugin that you are expecting a null by checking the null checkbox in front of output on the right side. See below example
You might be getting this error because in the JSON schema you are defining the value properties. You should allow systemanswertime parameter to be NULL.
You could try to parse the JSON value as follow:
"systemanswertime": {
"type": [
"string",
"null"
]
}
In the case you don't have access to the JSON file, you could try to use this plug in in order to enable the HTTP to manage nulleable values by dynamically substituting the configurations that can be served by the HTTP Server. You will need access to the HTTP endpoint in order construct an accessible HTTP endpoint that can serve content similar to:
{
"name" : "output.schema", "type" : "schema", "value" :
[
{ "name" : "id", "type" : "int", "nullable" : true},
{ "name" : "first_name", "type" : "string", "nullable" : true},
{ "name" : "last_name", "type" : "string", "nullable" : true},
{ "name" : "email", "type" : "string", "nullable" : true},
]
},
In case you are facing an error such as: No matching schema found for union type: ["string","null"], you could try the following workaround. The root cause of this errors are when the entries in the response from the API doesn't have all the fields it needs to have. For example, some entries may have callerId, channel, last_channel, last data, etc... but others entries may have not have last_channel or whatever other field from the JSON. This leads to a mismatch in the schema provided in the HTTP source and the pipeline fails right away.
As pear this when nodes encounter null values, logical errors, or other sources of errors, you may use an error handler plugin to catch errors. The way is as following:
In the HTTP source plug-in, change the following:
Output schema to account for custom field.
JSON/XML field mapping to account into custom field.
Changed Non-HTTP Error Handling field to Send to Error. This way it pushes the records through error collector and the pipeline proceeds with subsequent records.
Added Error Collector and a sink to capture the error records.
With this method you will be able to run the pipeline and had the problematic fields detected.
Kind regards,
Manuel

dynamodb PartiQL SELECT query returns ValidationException: Unexpected from source

I am using Amplify to setup a dynamodb with a corresponding lambda using the amplify blueprint for dynamodb.
Accessing the dynamodb the "classic" way with KeyConditionExpression etc works just fine but today I wanted to try and use PartiQL instead with the executeStatement and I am just not able to get it to work.
I have added the "dynamodb:PartiQLSelect" permission to the cloudfront template where all the other dynamodb permissions are so it looks like:
"Action": [
"dynamodb:DescribeTable",
"dynamodb:GetItem",
"dynamodb:Query",
"dynamodb:Scan",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:PartiQLSelect"
],
and I do not get any permission error so I hope that part is ok, it does however return the same error even without that line added.
The error that is always returned is:
ValidationException: Unexpected from source"
and no matter what I have tried, it does not help.
My code is quite basic so far:
const dynamodb2 = new AWS.DynamoDB();
let tableName = "habits_sensors";
if(process.env.ENV && process.env.ENV !== "NONE") {
tableName = tableName + '-' + process.env.ENV;
}
app.get(path, function(req, res) {
let params = {
Statement: `select * from ${tableName}`
};
dynamodb2.executeStatement(params, (err, data) => {
if (err) {
res.statusCode = 500;
res.json({error: `Could not get users from : ${tableName} =>` + err});
} else {
res.json(data.Items);
}
});
});
The complete error string returned from the lambda is:
{
"error": "Could not get users from : habits_sensors-playground =>ValidationException: Unexpected from source"
}
and I have the table habits_sensors-playground in my AWS account and I can access it the classic way without problems. That is why the "Unexpected from source" is confusing. I interpret it as referring to that the tableName (in from) in the select query is not correct but the name is matching what I have in AWS and it works using the documentclient.
Any suggestion on what might be wrong is very appreciated.
Answering myself in case anyone else ends up here.
I got a reply from AWS that if the table name contains dashes, you need to quote the table name with double quotes when using PartiQL (I had tried single quotes and that did not work).
Maybe this will change in a future release of PartiQL.
The exception ValidationException: Unexpected from source (CLI: An error occurred (ValidationException) when calling the ExecuteStatement operation: Unexpected from source) happens when table name contains dashes and is not quoted.
So change
aws dynamodb execute-statement --statement \
"SELECT * FROM my-table WHERE field='string'"
To:
aws dynamodb execute-statement --statement \
"SELECT * FROM \"my-table\" WHERE field='string'"
or add the double quotes " around the table name in the SDK you're using to use PartiQL.
Note that the WHERE string='value' uses single quotes ' and the table name requires double quotes ".
For anyone else who ends up here as I did, I had a powershell script generating a statement for aws dynamodb execute-statement (aws --version 2.x) and was getting the same error. After far too long, I tried the interactive cli and found that my query worked, so what I ended up needing to do was escape the double quotes for both powershell purposes AND again with \ characters for the AWS CLI.
$statement = "SELECT id FROM \`"${tablename}\`" WHERE source = '990doc'"
This double escaping finally got me where I needed to be, and will hopefully save someone else a great deal of frustration.
Adding this solution for anyone ended up here with this error from AWS SDK(Javascript).
const { Items = [] } = await dynamodbClient.executeStatement({
Statement: `SELECT * FROM "${tablename}" WHERE "_usedId" IS MISSING`
}).promise();
Surround the table name with double quotes as mentioned by the answers above.

Batch write to DynamoDB

I am trying to write data using batch_writer into DynamoDB using Lambda function. I am using "A1" as the partition key for my DynamoDB and when I try to pass the following Json input it works well.
{
"A1":"001",
"A2":{
"B1":"100",
"B2":"200",
"B3":"300"
}
}
When I try to send the following request I get an error.
{
"A1":{
"B1":"100",
"B2":"200",
"B3":"300"
}
}
Error -
"errorMessage": "An error occurred (ValidationException) when calling the BatchWriteItem operation: The provided key element does not match the schema"
Is it possible to write the data into DynamoDB using lambda function for this data and what should I change in my code to do that?
My code -
def lambda_handler(event, context):
with table.batch_writer() as batch:
batch.put_item(event)
return {"code":200, "message":"Data added success"}
It's hard to say without seeing the table definition, but my bet is that "A1" is the primary key of type string. If you try setting it to a map, it will fail.

Redshift: copy command Json data from s3

I have the following JSON data.
{
"recordid":"69",
"recordTimestamp":1558087302591,
"spaceId":"space-cd88557d",
"spaceName":"Kirtipur",
"partnerId":"Kirtipur",
"eventType":"event-location-update",
"eventlocationupdate":{
"event":{
"eventid":"event-qcTUrDAThkbPsXi438rRk",
"userId":"",
"tags":[
],
"mobile":"",
"email":"",
"gender":"OTHER",
"firstName":"",
"lastName":"",
"postalCode":"",
"optIns":[
],
"otherFields":[
],
"macAddress":"55:56:81🇧🇦a4:6d"
},
"location":{
"locationId":"location-bdfsfsf6a8d96",
"name":"Kirtipur Office - wireless",
"inferredLocationTypes":[
"NETWORK"
],
"parent":{
"locationId":"location-c39ffc49",
"name":"Kirtipur",
"inferredLocationTypes":[
"vianet"
],
"parent":{
"locationId":"location-8b47asdfdsf1c6a",
"name":"Kirtipur",
"inferredLocationTypes":[
"ROOT"
]
}
}
},
"ssid":"",
"rawUserId":"",
"visitId":"visit-ca04ds5secb8d",
"lastSeen":1558087081000,
"deviceClassification":"",
"mapId":"",
"xPos":1.8595887,
"yPos":3.5580606,
"confidenceFactor":0.0,
"latitude":0.0,
"longitude":0.0
}
}
I need to load this from the s3 bucket using the copy command. I have uploaded this file to my S3 bucket.
I have worked with copy command for csv files but have not worked with copy command on JSON files. I researched regarding json import via copy command but did not find solid helpful command examples.
I used the following code for my copy command.
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 'auto';
This did not insert any data.
Can anyone please help me with the copy command for such JSON?
Thanks and Regards
There are 2 scenarios (most probably 1st):
You want AWS's auto option to load from the s3 you provided in line 2. For that, you do:
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
json 'auto';
Use custom json loading paths (i.e. you don't want all paths automatically)
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 's3://vianet-test/vianet_PATHS.json';
Here, 's3://vianet-test/vianet_PATHS.json' contains all the specific JSON from the main location you want to look at.
Refer: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-copy-from-json
One issue I notice is the formatting. It is nicely formatted the way you shared which is good to see for us, but when loading it into Redshift via COPY command I generally trim the JSON by removing all 'new line' and blank spaces.

AWS iot dynamodb rule ${value} NoSuchElementException

I'm trying to set an AWS IOT rule to send data to DynamoDB without the help of a lambda.
My rule query statement is : SELECT *, topic() AS topic, timestamp() AS timestamp FROM '+/#'
My data is fine in AWS IOT as I'm successfully retrieving it with a lambda. However, even by following the developer guide to create the rule, in order to get the information passed on to Dynamo, by setting the 2 form fields with ${topic} and ${timestamp} as it should work, I get nothing in DynamoDB and I can find the following exception in Cloudwatch :
MESSAGE:Dynamo Insert record failed. The error received was NoSuchElementException. Message arrived on: myTopic/data, Action: dynamo, Table: myTable, HashKeyField: topic, HashKeyValue: , RangeKeyField: Some(timestamp), RangeKeyValue:
HashKeyValue and RangeKeyValue seem to be empty. Why ?
I also posted the question on the AWS forum : https://forums.aws.amazon.com/thread.jspa?threadID=267987
Suppose your devide sends this payload:
mess={"reported":
{"light": "blue","Temperature": int(temp_data)),
"timestamp": str(pd.to_datetime(time.time()))}}
args.message=mess
You should query as:
SELECT message.reported.* FROM '#'
Then, set up DynamoDB hash key value as ${MessageID()}
You will get:
MessageID || Data
1527010174562 { "light" : { "S" : "blue" }, "Temperature" : { "N" : "41" }, "timestamp" : { "S" : "1970-01-01 00:00:01.527010174" }}
Then you can easily extract values using Lambda and send to S3 via Data Pipeline or to Firehose to create a data stream.