Run SQL script file with multiple complex queries using AMAZON datapipeline - amazon-web-services

I have just created an account on Amazon AWS and I am going to use DATAPIPELINE to schedule my queries. Is it possible to run multiple complex SQL queries from .sql file using SQLACTIVITY of data pipeline?
My overall objective is to process the raw data from REDSHIFT/s3 using sql queries from data pipeline and save it to s3. Is it the feasible way to go?
Any help in this regard will be appreciated.

Yes, if you plan on moving the data from Redshift to S3, you need to do an UNLOAD command found here: http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

The input of your sql queries will be a single DATA Node and Output will be a single data file. Data pipeline provide only one "Select query" field in which you will write your extraction/transformation query. I don't think there are any use case of multiple queries file.
However if you want to make your pipeline configurable ,you can make your pipeline configurable by adding "parameters" and values objects in your pipeline definition JSON.
{
"objects":[
{
"selectQuery":"#{myRdsSelectQuery}"
}
],
"parameters":[
{
"description":"myRdsSelectQuery",
"id":"myRdsSelectQuery",
"type":"String"
}
],
"values":{
"myRdsSelectQuery":"Select Query"
}
}
If you want to execute and schedule multiple sql script , you can do with ShellCommandActivity.

I managed to execute a script with multiple insert statements with following AWS datapipeline configuration:
{
"id": "ExecuteSqlScript",
"name": "ExecuteSqlScript",
"type": "SqlActivity",
"scriptUri": "s3://mybucket/inserts.sql",
"database": { "ref": "rds_mysql" },
"runsOn": { "ref": "Ec2Instance" }
}, {
"id": "rds_mysql",
"name": "rds_mysql",
"type": "JdbcDatabase",
"username": "#{myUsername}",
"*password": "#{*myPassword}",
"connectionString" : "#{myConnStr}",
"jdbcDriverClass": "com.mysql.jdbc.Driver",
"jdbcProperties": ["allowMultiQueries=true","zeroDateTimeBehavior=convertToNull"]
},
It is important to allow the MySql driver to execute multiple queries with allowMultiQueries=true and the script s3 path is provided by scriptUri

Related

AWS quicksight can't ingest csv from s3 but the same data uploaded as file works

I am new to quicksight and was just test driving (on the quicksight web console. I'm not using the command line in this entire thing) with some data (can't share, confidential business info). I have a strange issue. when I create a dataset by uploading the file, which is only 50 mb, it works fine and I can see a preview of the table and I am able to proceed to the visualization. But when I upload the same file to the s3 and make a manifest and submit it using the 'use s3' option in the creat dataset window, I get the INCORRECT_FIELD_COUNT error.
here's the manifest file:
{
"fileLocations": [
{
"URIs": [
"s3://testbucket/analytics/mydata.csv"
]
},
{
"URIPrefixes": [
"s3://testbucket/analytics/"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"containsHeader": "true"
}
}
I know the data is not fully structured with some rows where a few columns are missing but how is it possible for quicksight to automatically infer and put NULLs into shorter rows when uploaded from local machine but not as an s3 file with the manifest? are there some different setttings that i'm missing?
I'm getting the same thing - looks like this is fairly new code. It'd be useful to know what the expected field count is, especially as it doesn't say if it's too few or too many (both are wrong). One of those technologies that looks promising, but I'd say there's a little maturing required.

AWS Glue Crawler updating existing catalog tables is (painfully) slow

I am continuously receiving and storing multiple feeds of uncompressed JSON objects, partitioned to the day, to different locations of an Amazon S3 bucket (hive-style: s3://bucket/object=<object>/year=<year>/month=<month>/day=<day>/object_001.json), and was planning to incrementally batch and load this data to a Parquet data lake using AWS Glue:
Crawlers would update manually created Glue tables, one per object feed, for schema and partition (new files) updates;
Glue ETL Jobs + Job bookmarking would then batch and map all new partitions per object feed to a Parquet location now and then.
This design pattern & architecture seemed to be quite a safe approach as it was backed up by many AWS blogs, here and there.
I have a crawler configured as so:
{
"Name": "my-json-crawler",
"Targets": {
"CatalogTargets": [
{
"DatabaseName": "my-json-db",
"Tables": [
"some-partitionned-json-in-s3-1",
"some-partitionned-json-in-s3-2",
...
]
}
]
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
And each table was "manually" initialized as so:
{
"Name": "some-partitionned-json-in-s3-1",
"DatabaseName": "my-json-db",
"StorageDescriptor": {
"Columns": [] # i'd like the crawler to figure this out on his first crawl,
"Location": "s3://bucket/object=some-partitionned-json-in-s3-1/",
"PartitionKeys": [
{
"Name": "year",
"Type": "string"
},
{
"Name": "month",
"Type": "string"
},
{
"Name": "day",
"Type": "string"
}
],
"TableType": "EXTERNAL_TABLE"
}
}
First run of the crawler is, as expected, an hour-ish long, but it successfully figures out the table schema and existing partitions. Yet from that point onward, re-running the crawler takes the exact same amount of time as the first crawl, if not longer; which lead me to believe that the crawler is not only crawling for new files / partitions, but recrawling all the entire S3 locations each time.
Note that the delta of new files between two crawls is very small (few new files are to be expected each time).
AWS Documentation suggests running multiple crawlers, but I am not convinced that this would solve my problem on the long run. I also considered updating the crawler exclude pattern after each run, but then I would see too few advantages using Crawlers over manually updating Tables partitions through some Lambda boto3 magic.
Am I missing something there ? Maybe an option I would have misunderstood regarding crawlers updating existing data catalogs rather than crawling data stores directly ?
Any suggestions to improve my data cataloging ? Given that indexing these JSON files in Glue tables is only necessary to me as I want my Glue Job to use bookmarking.
Thanks !
AWS Glue Crawlers now support Amazon S3 event notifications natively, to solve this exact problem.
See the blog post.
Still getting some hits on this unanswered question of mine, so I wanted to share a solution I found adequate at the time: I ended up not using crawlers, at all to incrementally update my Glue tables.
Using S3 Events / S3 Api Calls via CloudTrail / S3 Eventbridge notifications (pick one), ended up writing a lambda which pops a ALTER TABLE ADD PARTITION DDL query on Athena, updating an already existing Glue table with the newly created partition, based on the S3 key prefix. This is a pretty straight-forward and low-code approach to maintaining Glue tables in my opinion; the only downside being handling service throttling (both Lambda and Athena), and failing queries to avoid any loss of data in the process.
This solution scales up pretty well though, as parallel DDL queries per account is a soft-limit quota that can be increased as your need for updating more and more tables increases; and works pretty well for non-time-critical workflows.
Works even better if you limit S3 writes to your Glue tables S3 partitions (one file per Glue table partition is ideal in this particular implementation) by batching your data, using a Kinesis DeliveryStream for example.

Serverless framework for AWS : Adding initial data into Dynamodb table

Currently I am using Serverless framework for deploying my applications on AWS.
https://serverless.com/
Using the serverless.yml file, we create the DynamoDB tables which are required for the application. These tables are accessed from the lambda functions.
When the application is deployed, I want few of these tables to be loaded with the initial set of data.
Is this possible?
Can you provide me some pointers for inserting this initial data?
Is this possible with AWS SAM?
Don't know if there's a specific way to do this in serverless, however, just add a call to the AWS CLI like this to your build pipeline:
aws dynamodb batch-write-item --request-items file://initialdata.json
Where initialdata.json looks something like this:
{
"Forum": [
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon DynamoDB"},
"Category": {"S":"Amazon Web Services"},
"Threads": {"N":"2"},
"Messages": {"N":"4"},
"Views": {"N":"1000"}
}
}
},
{
"PutRequest": {
"Item": {
"Name": {"S":"Amazon S3"},
"Category": {"S":"Amazon Web Services"}
}
}
}
]
}
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
A more Serverless Framework option is to use a tool like the serverless-plugin-scripts plugin that allows you to add your own CLI commands to the deploy process by default:
https://github.com/mvila/serverless-plugin-scripts

Error when updating a PowerBI workspace collection from an arm template

We have deployed a PowerBI embedded workspace collection with the following really simple arm template
{
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {},
"variables": {},
"resources": [
{
"comments": "Test Power BI workspace collection",
"apiVersion": "2016-01-29",
"type": "Microsoft.PowerBI/workspaceCollections",
"location": "westeurope",
"sku": {
"name": "S1",
"tier": "Standard"
},
"name": "myTestPowerBiCollection",
"tags": {
"displayNmae": "Test Power BI workspace collection"
}
}
],
"outputs": {}
}
For deployment we used the well known Powershell command New-AzureRmResourceGroupDeployment After the creation if we try to execute the command again it fails with the following message
New-AzureRmResourceGroupDeployment : Resource Microsoft.PowerBI/workspaceCollections 'myTestPowerBiCollection' failed with message
{
"error": {
"code": "BadRequest",
"message": ""
}
}
If we delete the collection and execute again succeeds without a problem. I tried with both the options for the -Mode parameter (Incremental, Complete) and didn't help, even though Incremental is the default option.
This is a major issue for us as we want to provision the collection as part of our Continuous Delivery and we execute this several times.
Any ideas on how to bypass this problem?
As you mentioned , if PowerBI Workspace Collection name is existed, it will throw expection when we try to deploy the PowerBI Workspace Collection again.
If it is possible to add customized logical code, we could use Get-AzureRmPowerBIWorkspaceCollection to check whether PowerBI Workspace Collection is existed. If it is existed, it will return PowershellBIworkspaceCollection object, or will throw not found exception.
We also could use Remove-AzureRmPowerBIWorkspaceCollection command to remove PowerBI Workspace Collection. If PowerBI workspace Connection is existed we could skip to deploy or delete and renew it according to our logic.

Want to server-side encrypt S3 data node file created by ShellCommandActivity

I created a ShellCommandActivity with stage = "true". Shell command creates a new file and store it in ${OUTPUT1_STAGING_DIR}. I want this new file to be server-side encrypted in S3.
According to document all files created in s3 data node are server-side encrypted by default. But after my pipeline completes an un-encrypted file gets created in s3. I tried setting s3EncryptionType as SERVER_SIDE_ENCRYPTION explicitly in S3 datanode but that doesn't help either. I want to encrypt this new file.
Here is relevant part of pipeline:
{
"id": "DataNodeId_Fdcnk",
"schedule": {
"ref": "DefaultSchedule"
},
"directoryPath": "s3://my-bucket/test-pipeline",
"name": "s3DataNode",
"s3EncryptionType": "SERVER_SIDE_ENCRYPTION",
"type": "S3DataNode"
},
{
"id": "ActivityId_V1NOE",
"schedule": {
"ref": "DefaultSchedule"
},
"name": "FileGenerate",
"command": "echo 'This is a test' > ${OUTPUT1_STAGING_DIR}/foo.txt",
"workerGroup": "my-worker-group",
"output": {
"ref": "DataNodeId_Fdcnk"
},
"type": "ShellCommandActivity",
"stage": "true"
}
Short answer: Your pipeline definition looks correct. You need to ensure you're running the latest version of the Task Runner. I will try to reproduce your issue and let you know.
P.S. Let's keep conversation within a single thread here or in AWS Data Pipeline forums to avoid confusion.
Answer on official AWS Data Pipeline Forum page
This issue is resolved when I downloaded new TaskRunner-1.0.jar. I was running older version.