AWS Quicksight: How to Get the latest Data from S3 - amazon-web-services

My Quicksight currently takes everything in the S3 bucket
(S3 sample) https://i.stack.imgur.com/cO8kL.png
But S3 keep changing folder base on the date so 01/,02/,03/ and so on Is there a way to only take the latest data not all of it?
[this my current manifest:]
{
"fileLocations": [
{
"URIPrefixes": [
"https://sample-S3bucket.amazonaws.com/"
]}
],
"globalUploadSettings": {
"format": "JSON"
}
}
There might be a simple solution that I might not know about.

You could set Quicksight to read from another bucket and set a lambda that is triggered when a a new file is uploaded into your existing bucket. This lambda would:
Remove any files from the bucket which QuickSight is reading from
Copy over the latest file into the bucket
Creates a QuickSight SPICE ingestion via api.

Related

How to copy data from Amazon S3 to DDB using AWS Glue

I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps:
Export DDB table into Amazon S3
Use a Glue job to read the files from the Amazon S3 bucket and write them to the target DynamoDB table
I was able to do the first step. Unfortunately the instructions don't say how to do the second step. I have worked with Glue a couple of times, but the console UI is very user un-friendly and I have no idea how to achieve it.
Can somebody please explain how to import the data from S3 into the DDB?
You could use Glue studio to generate a script.
Log into AWS
Go to Glue
Go to Glue studio
Set up the source , basically point it to S3
and then use something like below this is for a dynamo db with pk and sk as a composite primary key
This is just the mapping to a Dataframe and writing it to DynamoDB
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("Item.pk.S", "string", "Item.pk.S", "string"),
("Item.sk.S", "string", "Item.sk.S", "string")
],
transformation_ctx="ApplyMapping_node2"
)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={"dynamodb.output.tableName": "my-target-table"}
}

Terraform for AWS: How to have multiple events per S3 bucket filtered on object path?

My understanding is that when configuring an S3 bucket notification with Terraform we can only configure a single notification per S3 bucket:
NOTE: S3 Buckets only support a single notification configuration. Declaring multiple
ws_s3_bucket_notification resources to the same S3 Bucket will cause a perpetual difference in
configuration. See the example "Trigger multiple Lambda functions" for an option.
The application uses a single S3 bucket as a data repository, i.e. when JSON files land there they trigger a lambda which submits a corresponding batch job to ingest from the file into a database.
This works well when we have a single developer deploying the infrastructure, but with multiple developers each time one of us runs terraform apply then it updates the only/single notification for the bucket, overwriting the resource's previous settings.
What is the best practice for utilizing S3 buckets for notifications? Are they best configured/created per Terraform workspace, and/or how are the buckets managed to allow for simultaneous developers standing up/down infrastructure resources using a common S3 bucket via terraform apply, etc.? Must you use one bucket per workspace for this use case, as suggested by the docs?
The current Terraform I have for the S3 notification (the code that allows for overwriting with the latest configuration):
data "aws_s3_bucket" "default" {
bucket = var.bucket
}
resource "aws_lambda_permission" "allow_bucket_execution" {
statement_id = "AllowExecutionFromS3Bucket"
action = "lambda:InvokeFunction"
function_name = var.lambda_function_name
principal = "s3.amazonaws.com"
source_arn = data.aws_s3_bucket.default.arn
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = data.aws_s3_bucket.default.bucket
lambda_function {
lambda_function_arn = var.lambda_function_arn
events = ["s3:ObjectCreated:*"]
filter_prefix = var.namespace
filter_suffix = ".json"
}
}
The namespace variable is passed in as "${local.env}-${terraform.workspace}", with local.env as "dev", "uat", "prod", etc.
How can we modify the Terraform code above to allow for multiple notifications per S3 bucket (essentially one per Terraform workspace), or can it just not be done? If not then how is this best handled? Should I just use a bucket per workspace using a namespace variable like above as the S3 bucket name, and have it updated accordingly to the production bucket at deployment?
There are several options you have depending on your needs:
Create one bucket per env and workspace. Then the mentioned limitation of terraforms aws_s3_bucket_notification should no longer be an issue. I could imagine that the process you use to write to your bucket will then still only write to one bucket you specify. To solve this issue you could think about forwarding any objects uploaded to one "master" bucket to all other buckets (either with a lambda, itself triggered by a aws_s3_bucket_notification or probably by bucket replication).
create one bucket per env and deploy the aws_s3_bucket_notification without workspaces. Then you do no longer have the advantages of workspaces. But this might be a reasonable compromise between number of buckets and usability
keep just this one bucket, keep envs and workspaces, but deploy the aws_s3_bucket_notification resource only once (probably together with the bucket). Then this one aws_s3_bucket_notification resource would need to include the rules for all environment and workspaces.
It really depends on your situation what fits best. If those aws_s3_bucket_notification rarely change at all, and most changes are done in the lambda function, the last option might be the best. If you regularly want to change the aws_s3_bucket_notification and events to listen on, one of the other options might be more suitable.

AWS Glue - boto3 crawler not creating table

I am trying to create and run a AWS glue crawler through the boto3 library. The crawler is going against JSON files in an s3 folder. The crawler completes successfully, when i check the logs there are no errors but it doesn't create any table in my glue database
It's not a permission issue as I am able to create the same crawler through a CFT and when I run that it creates the table as expected. Im using the same role as my CFT in my code I'm running with boto3 to create it.
Have tried using boto3 create_crawler() and run_crawler(). Tried using boto3 update_crawler() on the crawler created from the CFT and updating the s3 target path.
response = glue.create_crawler(
Name='my-crawler',
Role='my-role-arn',
DatabaseName='glue_database',
Description='Crawler for generating table from s3 target',
Targets={
'S3Targets': [
{
'Path': s3_target
}
]
},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
},
TablePrefix=''
)
Are you sure you have passed correct region in glue client (glue object creation).
Once I copied code and forgot to change region and spent hours figuring out why it is not creating table when there is no error. Eventually I ended in figuring out that table is created in another region as I forgot to change region while I copied code to new region.

AWS ImportImage operation: S3 bucket does not exist

I'm trying to import a windows 2012 OVA file into aws. I'm using this documentation.
AWS VMWare Import
I've created an s3 bucket to store the OVA files, and the OVA files have been uploaded there.
And when I try to import the images files into AWS, I get an error:
aws ec2 import-image --description "server1" --disk-containers file://containers.json --profile=company-dlab_us-east-1
An error occurred (InvalidParameter) when calling the ImportImage operation: S3 bucket does not exist: s3://companyvmimport/
Which is strange because I can list the bucket I'm trying to upload to using the aws command line:
aws s3 ls --profile=company-dlab_us-east-1
2016-10-20 09:52:33 companyvmimport
This is my containers.json file:
[
{
"Description": "server1",
"Format": "ova",
"UserBucket": {
"S3Bucket": "s3://companyvmimport/",
"S3Key": "server1.ova"
}
}]
Where am I going wrong? How can I get this to work?
I think you have an issue in your copy/paste, in your containers.json file you reference bucket as s3://companyvmimport but you have error about kpmgvmimport
anyway you dont need to indicate the s3 protocol in the json
your JSon file should look like
[
{
"Description": "server1",
"Format": "ova",
"UserBucket": {
"S3Bucket": "companyvmimport",
"S3Key": "server1.ova"
}
}]
If the file is not right at the "root" of the bucket you need to indicate full path.
I think the comment in the answer, about setting the custom policy on the S3 bucket, contains JSON that may grant access to everyone (which may not be what's desired).
If you add a Principal statement to the JSON you can limit the access to just yourself.
I ran into this same issue (User does not have access to the S3 object), and after fighting it for a while, with the help of this post and some further research, finally figured it out. I opened a separate post specifically for this "does not have access" issue.

How can you update the storage class for all items in an S3 bucket?

I've found myself wishing I could easily change all items in a bucket to a particular storage class on S3. Often this is because items were uploaded in Standard, and I want them in Reduced Redundancy to save a few bucks.
I don't see a way to do this through the AWS Console.
What is the best way to update all the files in a bucket?
Use the awscli pip package
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Example:
aws s3 cp \
--storage-class STANDARD_IA \
--region='us-west-2' \
--recursive \
s3://myBucket/logs/ s3://myBucket/logs/
As you are talking about the entire bucket, I think the best way to do this would be with creating a life cycle rule. That can be done through the console:
Open the AWS console, go to the relevant bucket and click on "Management".
Click on "Lifecycle" and then on "Add Lifecycle rule".
Give it a name and click on next.
Select whether you want it to run on current versions or on previous versions (If versioning is configured for this bucket), and then select "Single Zone IA".
Click on next three times, and you're done.
Or you could do it through the AWS-CLI like so:
Create a json file called lifecycle.json like so:
{
"Rules": [
{
"ID": "Move all objects to one zone infrequent access",
"Prefix": "",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "ONEZONE_IA"
}
]
}
]
}
and then run:
aws s3api put-bucket-lifecycle-configuration --bucket <Bucket name> --lifecycle-configuration file://./lifcycle.json
In transition I placed 30 days as it is, currently, the minimum time required for an object to exist before being transitioned to one zone IA.
There is no way to do this via the AWS Console. You will need to iterate over them and update the meta data on each object.
Here's a ruby script that does just that:
https://gist.github.com/mcfadden/b1e564f3323f98720ff2
A few other thoughts:
Set the correct storage class on object creation. You won't want to loop through all the items again.
Some Storage Classes aren't available for all objects. For example, you can't set objects to the Standard - Infrequent Access class until they've been in the bucket for 30 days.
If you are trying to use the Standard - Infrequent Access storage class, you can set up a lifecycle rule to automatically move objects to this storage class after 30 days.