Redshift: copy command Json data from s3 - amazon-web-services

I have the following JSON data.
{
"recordid":"69",
"recordTimestamp":1558087302591,
"spaceId":"space-cd88557d",
"spaceName":"Kirtipur",
"partnerId":"Kirtipur",
"eventType":"event-location-update",
"eventlocationupdate":{
"event":{
"eventid":"event-qcTUrDAThkbPsXi438rRk",
"userId":"",
"tags":[
],
"mobile":"",
"email":"",
"gender":"OTHER",
"firstName":"",
"lastName":"",
"postalCode":"",
"optIns":[
],
"otherFields":[
],
"macAddress":"55:56:81šŸ‡§šŸ‡¦a4:6d"
},
"location":{
"locationId":"location-bdfsfsf6a8d96",
"name":"Kirtipur Office - wireless",
"inferredLocationTypes":[
"NETWORK"
],
"parent":{
"locationId":"location-c39ffc49",
"name":"Kirtipur",
"inferredLocationTypes":[
"vianet"
],
"parent":{
"locationId":"location-8b47asdfdsf1c6a",
"name":"Kirtipur",
"inferredLocationTypes":[
"ROOT"
]
}
}
},
"ssid":"",
"rawUserId":"",
"visitId":"visit-ca04ds5secb8d",
"lastSeen":1558087081000,
"deviceClassification":"",
"mapId":"",
"xPos":1.8595887,
"yPos":3.5580606,
"confidenceFactor":0.0,
"latitude":0.0,
"longitude":0.0
}
}
I need to load this from the s3 bucket using the copy command. I have uploaded this file to my S3 bucket.
I have worked with copy command for csv files but have not worked with copy command on JSON files. I researched regarding json import via copy command but did not find solid helpful command examples.
I used the following code for my copy command.
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 'auto';
This did not insert any data.
Can anyone please help me with the copy command for such JSON?
Thanks and Regards

There are 2 scenarios (most probably 1st):
You want AWS's auto option to load from the s3 you provided in line 2. For that, you do:
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
json 'auto';
Use custom json loading paths (i.e. you don't want all paths automatically)
COPY vianet_raw_data
from 's3://vianet-test/vianet.json'
with credentials as ''
format as json 's3://vianet-test/vianet_PATHS.json';
Here, 's3://vianet-test/vianet_PATHS.json' contains all the specific JSON from the main location you want to look at.
Refer: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-copy-from-json

One issue I notice is the formatting. It is nicely formatted the way you shared which is good to see for us, but when loading it into Redshift via COPY command I generally trim the JSON by removing all 'new line' and blank spaces.

Related

cloud-init - I am trying to copy files to Windows EC2 instance through cloud init by passing it through user data

I am trying to copy files to Windows EC2 instance through cloud init by passing it through user data, the cloud init template runs, it creates a folder but doesnot copy the files, can you help me understand what I am doing wrong in my code.
this code is passed through launch configuration of an autoscaling group
data template_cloudinit_config ebs_snapshot_scripts {
gzip = false
base64_encode = false
part {
content_type = "text/cloud-config"
content = <<EOF
<powershell>
$path = "C:\aws"
If(!(test-path $path))
{
New-Item -ItemType Directory -Force -Path $path
}
</powershell>
write_files:
- content: |
${file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/1-start-ebs-snapshot.ps1")}
path: C:\aws\1-start-ebs-snapshot.ps1
permissions: '0744'
- content: |
${file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/2-run-backup.cmd")}
path: C:\aws\2-run-backup.cmd
permissions: '0744'
- content: |
${file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/3-ebs-snapshot.ps1")}
path: C:\aws\3-ebs-snapshot.ps1
permissions: '0744'
EOF
}
}
Your current approach involves using the Terraform template language to produce YAML by concatenating together strings, some of which are multi-line strings from an external file, and that will always be pretty complex to get right because YAML is a whitespace-sensitive language.
I have two ideas to make this easier. You could potentially do both of them, although doing one or the other could work too.
The first idea is to follow the recommendations about generating JSON and YAML from Terraform's templatefile function documentation. Although your template is inline rather than in a separate file, you can apply a similar principle here to have Terraform itself be responsible for producing valid YAML, and then you can just worry about making the input data structure be the correct shape:
part {
content_type = "text/cloud-config"
# JSON is a subset of YAML, so cloud-init should
# still accept this even though it's jsonencode.
content = jsonencode({
write_files = [
{
content = file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/1-start-ebs-snapshot.ps1")
path = "C:\\aws\\1-start-ebs-snapshot.ps1"
permissions = "0744"
},
{
content = file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/2-run-backup.cmd")
path = "C:\\aws\\2-run-backup.cmd"
permissions = "0744"
},
{
content = file("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/3-ebs-snapshot.ps1")
path = "C:\\aws\\3-ebs-snapshot.ps1"
permissions = "0744"
},
]
})
}
The jsonencode and yamlencode Terraform functions know how to escape newlines and other special characters automatically, and so you can just include the file content as an attribute in the object and Terraform will encode it into a valid string literal automatically.
The second idea is to use base64 encoding instead of direct encoding. Cloud-init allows passing file contents as base64 if you set the additional property encoding to b64. You can then use Terraform's filebase64 function to read the contents of the file directly into a base64 string which you can then include into your YAML without any special quoting or escaping.
Using base64 also means that the files placed on the remote system should be byte-for-byte identical to the ones on disk in your Terraform module, whereas by using file into a YAML string there is the potential for line endings and other whitespace to be changed along the way.
On the other hand, one disadvantage of using base64 is that the file contents won't be directly readable in the terraform plan output, and so the plan won't be as clear as it would be with just plain YAML string encoding.
You can potentially combine both of these ideas together by using the filebase64 function as part of the argument to jsonencode in the first example:
# ...
{
encoding = "b64"
content = filebase64("${path.module}/../../../../scripts/aws-ebs-snapshot-ps/1-start-ebs-snapshot.ps1")
path = "C:\\aws\\1-start-ebs-snapshot.ps1"
permissions = "0744"
},
# ...
cloud-init only reliably writes files, so you have to provide content for them. I'd suggest storing your files in S3 (for example) and pulling them during boot.
Sorry for incoming MS-Linux mixup example.
Using same write files write a short script, eg.
#!/bin/bash
wget something.ps1
wget something-else.ps2
Then using runcmd/bootcmd run the files:
bootcmd:
- ./something.ps1
- ./something-else.ps2
Job is done, w/o encoding/character-escaping headache.

AWS Glue crawler - Getting "Internal Service Exception" on crawling json data

I am facing issues crawling data from S3 bucket.
File format is form.
When I try crawling this data from S3 I get "Internal Service Exception".
Can you please suggest a fix?
When I try loading the data directly from Athena, I see the following error for a field which is an array of strings:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key
Thanks,
..
There were spaces in the key names that I was using in the JSON.
{
...
"key Name" : "value"
...
}
I formatted my data to remove spaces from key names and converted all the keys to lower case.
{
...
"keyname" : "value"
...
}
This resolved the issue.

Delimiter not found error - AWS Redshift Load from s3 using Kinesis Firehose

I am using Kinesis firehose to transfer data to Redshift via S3.
I have a very simple csv file that looks like this. The firehose puts it to s3 but Redshift errors out with Delimiter not found error.
I have looked at literally all posts related to this error but I made sure that delimiter is included.
File
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:23:56.986397,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:02.061263,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:07.143044,848.78
GOOG,2017-03-16T16:00:01Z,2017-03-17 06:24:12.217930,848.78
OR
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:48:59.993260","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:07.034945","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:12.306484","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:18.020833","852.12"
"GOOG","2017-03-17T16:00:02Z","2017-03-18 05:49:24.203464","852.12"
Redshift Table
CREATE TABLE stockvalue
( symbol VARCHAR(4),
streamdate VARCHAR(20),
writedate VARCHAR(26),
stockprice VARCHAR(6)
);
Error
Error
Just in case, here's what my kinesis stream looks like
Firehose
Can someone point out what may be wrong with the file.
I added a comma between the fields.
All columns in destination table are varchar so there should be no reason for datatype error.
Also, the column lengths match exactly between the file and redshift table.
I have tried embedding columns in double quotes and without.
Can you post the full COPY command? It's cut off in the screenshot.
My guess is that you are missing DELIMITER ',' in your COPY command. Try adding that to the COPY command.
I was stuck on this for hours, and thanks to Shahid's answer it helped me solve it.
Text Case for Column Names is Important
Redshift will always treat your table's columns as lower-case, so when mapping JSON keys to columns, make sure the JSON keys are lower-case, e.g.
Your JSON file will look like:
{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}{'id': 'val1', 'name': 'val2'}
And the COPY statement will look like
COPY latency(id,name) FROM 's3://<bucket-name>/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST json 'auto';
Settings within Firehose must have the column names specified (again, in lower-case). Also, add the following to Firehose COPY options:
json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull
How to call put_records from Python:
Below is a snippet showing how to use the put_records functions with kinesis in python:
'objects' passed into the 'put_to_stream' function is an array of dictionaries:
def put_to_stream(objects):
records = []
for metric in metrics:
record = {
'Data': json.dumps(metric),
'PartitionKey': 'swat_report'
};
records.append(record)
print(records)
put_response = kinesis_client.put_records(StreamName=kinesis_stream_name, Records=records)
flush
``
1- You need to add FORMAT AS JSON 's3://yourbucketname/aJsonPathFile.txt'. AWS has not mentioned this already. Please note that this only works when your data is in JSON form like
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}
2- You also needs to verify the column order in kinesis firehouse and in csv file.and try adding
TRUNCATECOLUMNS blanksasnull emptyasnull
3- An example
COPY testrbl3 ( eventId,serverTime,pageName,action,ip,userAgent,location,plateform,language,campaign,content,source,medium,productID,colorCode,scrolltoppercentage) FROM 's3://bucketname/' CREDENTIALS 'aws_iam_role=arn:aws:iam:::role/' MANIFEST json 'auto' TRUNCATECOLUMNS blanksasnull emptyasnull;

Flattening JSON file while transferring from S3 to RedShift using AWS Pipeline

I have json file on S3, I want to transfer it to Redshift. One catch is that the file contains entries in such a format:
{
"user_id":1,
"metadata":
{
"connection_type":"WIFI",
"device_id":"1234"
}
}
Before I will save it to Redshift I want to flatten the file to contain columns:
user_id | connection_type | device_id
How can I do this using AWS Data Pipeline?
Is there activity that can transform json to the desired form? I do not think that transform sql will support json fields.
You do not need to flatten it. You can load it with the copy command after defining a jsonpaths config file to easily extract the column values from each json object.
With your structure you'd create a file in S3 (s3://bucket/your_jsonpaths.json) like so:
{
"jsonpaths": [
"$.user_id",
"$.metadata.connection_type",
"$.metadata.device_id"
]
}
Then you'd run something like this in Redshift:
copy your_table
from 's3://bucket/data_objects.json'
credentials '<aws-auth-args>'
json 's3://bucket/your_jsonpaths.json';
If you have issues see what is in the stv_load_errors table.
Check out the Redshift copy command and examples.

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?
The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.
Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.
Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\nā„ž "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results
You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good
If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn
I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)
there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!