How to create an Amazon S3 job to move big files - amazon-web-services

I need to copy a file from one folder to another inside an unique Amazon S3 bucket. However, due to files size, I can't simply call copyObject method from AWS SDK S3 class, since it timesout my Lambda function.
That's why I'm trying to create a S3 Batch Operations job to move this file, but I'm getting an Invalid job operation error when trying to. I'm using AWS SDK S3Control class, trying to invoke method createJob. I'm passing this object as parameter:
{
AccountId: '445084039568',
Manifest: {
Location: {
ETag: 'dbe4a392892992491a7124c10f2fbf03',
ObjectArn: 'arn:aws:s3:::amsp-media-bucket/manifest.csv'
},
Spec: {
Format: 'S3BatchOperations_CSV_20180820',
Fields: ['Bucket', 'Key']
},
},
Operation: {
S3PutObjectCopy: {
TargetResource: 'arn:aws:s3:::amsp-media-bucket/bigtest'
}
},
Report: {
Enabled: false
},
Priority: 10,
RoleArn: 'arn:aws:iam::445084039568:role/mehoasumsp-sandbox-asumspS3JobRole-64XWYA3CFZF3'
}
To be honest, I'm not sure if I'm specifying manifest correctly. This is manifest.csv content:
amsp-media-bucket, temp/37766a92-16ef-4ee2-8e79-3875679dad85.mkv
I'm not insecure about the file itself but about the way I define Spec property at param object.

Single quotes might not be valid in the job spec JSON. I have only seen double quotes.
In boto3 (Python SDK), using the managed .copy() function instead of .copy_object(), and tuning multipart_chunksize and the concurrency settings, the multiple UploadPartCopy requests may well complete within the Lambda runtime limit. The AWS JS SDK appears to lack an equivalent function, you may want to try something like https://github.com/Zooz/aws-s3-multipart-copy
As John Rotenstein said, beware the space in the object key in your CSV file.
S3PutObjectCopy S3 Batch Operation jobs use CopyObject, which has a size limit of 5GiB.
Together with the operation costs, S3 Batch Operation jobs cost $0.25 each, which might be expensive if copying a small number of objects.

Related

Evaluate AWS CDK Stack output to another Stack in different account

I am creating two Stack using AWS CDK. I use the first Stack to create an S3 bucket and upload lambda Zip file to the bucket using BucketDeployment construct, like this.
//FirstStack
const deployments = new BucketDeployment(this, 'LambdaDeployments', {
destinationBucket: bucket,
destinationKeyPrefix: '',
sources: [
Source.asset(path)
],
retainOnDelete: true,
extract: false,
accessControl: BucketAccessControl.PUBLIC_READ,
});
I use the second Stack just to generate CloudFormation template to my clients. In the second Stack, I want to create a Lambda function with parameters S3 bucket name and key name of the Lambda zip I uploaded in the 1st stack.
//SecondStack
const lambdaS3Bucket = "??"; //TODO
const lambdaS3Key = "??"; //TODO
const bucket = Bucket.fromBucketName(this, "Bucket", lambdaS3Bucket);
const lambda = new Function(this, "LambdaFunction", {
handler: 'index.handler',
runtime: Runtime.NODEJS_16_X,
code: Code.fromBucket(
bucket,
lambdaS3Key
),
});
How do I refer the parameters automatically from 2nd Lambda?
In addition to that, the lambdaS3Bucket need to have AWS::Region parameters so that my clients can deploy it in any region (I just need to run the first Stack in the region they require).
How do I do that?
I had a similar usecase to this one.
The very simple answer is to hardcode the values. The bucketName is obvious.
The lambdaS3Key You can look up in the synthesized template of the first stack.
More complex answer is to use pipelines for this. I've did this and in the build step of the pipeline I extracted all lambdaS3Keys and exported them as environment variable, so in the second stack I could reuse these in the code, like:
code: Code.fromBucket(
bucket,
process.env.MY_LAMBDA_KEY
),
I see You are aware of this PR, because You are using the extract flag.
Knowing that You can probably reuse this property for Lambda Key.
The problem of sharing the names between the stacks in different accounts remains nevertheless. My suggestion is to use pipelines and the exported constans there in the different steps, but also a local build script would do the job.
Do not forget to update the BucketPolicy and KeyPolicy if You use encryption, otherwise the customer account won't have the access to the file.
You could also read about the AWS Service Catalog. Probably this would be a esier way to share Your CDK products to Your customers (CDK team is going to support the out of the box lambda sharing next on)

AWS EventBridge: Add multiple event details to a target parameter

I created an EventBridge rule that triggers a Sagemaker Pipeline when someone uploads a new file to an S3 bucket. As new input files become available, they will be uploaded to the bucket for processing. I'd like the pipeline to process only the uploaded file, and so thought to pass in the S3 URL of the file as a parameter to the Pipeline. Since the full URL doesn't exist as a single field value in the S3 event, I was wondering if there is some way to concatenate multiple field values into a single parameter value that EventBridge will pass on to the target.
For example, I know the name of the uploaded file can be sent from EventBridge using $.detail.object.key and the bucket name can be sent using $.detail.bucket.name, so I'm wondering if I can send both somehow to get something like this to the Sagemaker Pipeline s3://my-bucket/path/to/file.csv
For what it's worth, I tried splitting the parameter into two (one being s3://bucket-name/ and the other being default_file.csv) when defining the pipeline, but got an error saying Pipeline variables do not support concatenation when combining the two into one.
The relevant pipeline step is
step_transform = TransformStep(name = "Name", transformer=transformer,inputs=TransformInput(data=variable_of_s3_path)
Input transformers manipulate the event payload that EventBridge sends to the target. Transforms consist of (1) an "input path" that maps substitution variable names to JSON-paths in the event and (2) a "template" that references the substitution variables.
Input path:
{
"detail-bucket-name": "$.detail.bucket.name",
"detail-object-key": "$.detail.object.key"
}
Input template that concatenates the s3 url and outputs it along with the original event payload:
{
"s3Url": "s3://<detail-bucket-name>/<detail-object-key>",
"original": "$"
}
Define the transform in the EventBridge console by editing the rule: Rule > Select Targets > Additional Settings.

How to download publicly available pdf and png files from S3 with AppSync

I'm fairly new to GraphQL and AWS AppSync, and I'm running into an issue downloading files (PDFs and PNGs) from a public S3 bucket via AWS AppSync. I've looked at dozens of tutorials and dug through a mountain of documentation, and I'm just not certain what's going on at this point. This may be nothing more than a misunderstanding about the nature of GraphQL or AppSync functionality, but I'm completely stumped.
For reference, I've heavily sourced from other posts like How to upload file to AWS S3 using AWS AppSync (specifically, from the suggestions by the accepted answer author), but none of the solutions (or the variations I've attempted) are working.
The Facts
S3 bucket is publicly accessible – i.e., included folders and files are not tied to individual users with Cognito credentials
Files are uploaded to S3 outside of AppSync (so there's no GraphQL mutation); it's a manual file upload
Schema works for all other queries and mutations
We are using AWS Cognito to authenticate users and queries
Abridged Schema and DynamoDB Items
Here's an abridged version of the relevant GraphQL schema types:
type MetroCard implements TripCard {
id: ID!
cardType: String!
resIds: String!
data: MetroData!
file: S3Object
}
type MetroData implements DataType {
sourceURL: String!
sourceFileURL: String
metroName: String!
}
type S3Object {
bucket: String!
region: String!
key: String!
}
Metadata about the files is stored in DynamoDB and looks something like this:
{
"data": {
"metroName": "São Paulo Metro",
"sourceFileURL": "http://www.metro.sp.gov.br/pdf/mapa-da-rede-metro.pdf",
"sourceURL": "http://www.metro.sp.gov.br/en/your-trip/index.aspx"
},
"file": {
"bucket": "test-images",
"key": "some_folder/sub_folder/bra-sbgr-metro-map.pdf",
"region": "us-east-1"
},
"id": "info/en/bra/sbgr/metro"
}
VTL Request/Response Resolvers
For our getMetroCard(id: ID!): MetroCard query, the mapping templates are pretty vanilla. The request template is a standard query on a DynamoDB table. The response template is a basic $util.toJson($ctx.result).
For the field-level resolver on MetroCard.file, we've attached a local data source with an empty {} payload for the request and the following for the response (see referenced link for reasoning):
$util.toJson($util.dynamodb.fromS3ObjectJson($context.source.file)) // we've played with this bit in a couple of ways, including simply returning $context.result but no change
Results
All of the query fields resolve appropriately; however, the file field inevitably always returns null no matter what the field-level resolver is mapped to. Interestingly, I've noticed in the CloudWatch logs the value of context.result does change from null to {} with the above mapping template.
Questions
Given the above, I have several questions:
Does AppSync file download require files to be uploaded to S3 with user credentials through a mutation with a complex object handler in order to make them retrievable?
What should a successful response look like in the AppSync console return – i.e., I have no client implementation (like a React Native app) to test successful file downloads? More directly, is it actually retrieving the files, and I just don't know it? (Note: I actually have tested it briefly with a React Native client, but nothing rendered so I've just been using the AppSync console returns as direction ever since.)
Does it make more sense to remove the file download process entirely from our schema? (I'm assuming the answers I need reveal that AppSync just wasn't built for file transfer like this, and so we'll need to rethink our approach.)
Update
I've started playing around with the data source for MetroCard.file per the suggestion of this recent post https://stackoverflow.com/a/52142178/5989171. If I make the data source the same as the database storing the file metadata, I now get the error mentioned in the ref but his solution doesn't seem to be working for me. Specifically, I now get the following:
"message": "Value for field '$[operation]' not found."
Our Solution
For our use case, we've decided to go ahead and use the AWS Amplify Storage module as suggested here: https://twitter.com/presbaw/status/1040800650790002689. Despite that, I'm keeping this question open and unanswered, because I'm just genuinely curious about what I'm not understanding here, and I have a feeling I'm not the only one!
$util.toJson($util.dynamodb.fromS3ObjectJson($context.source.file))
You can only use this if your DynamoDB save file field as format: {"s3":{"key":"file.jpg","bucket":"bucket_name/folder","region":"us-east-1"}}

How to move file from one in another S3 Bucket after time?

How to move file from one in another S3 Bucket after time?
i checked this thread: Is it possible to automatically move objects from an S3 bucket to another one after some time?
but i don't want to use the Glacier option, because our files a really small. Is there another option?
EDIT:
Requirement is to mark files as invalid (there is a metadata table where we change an attribute for this) and later on to delete them. (invalid means = e.g. older as some date...maybe 30 days). After that, this invalid files should be deleted after, we say 120 days.
Why to move files?
Separate by business requirement (invalid vs. valid files) - don't have to check this attribute again (whether it's invalid or valid)
less files to iterate over, if we may want to iterate over valid files
important: 'S3 PUT event' of new bucket can invoke a
lambda function: this lambda function can do other stuff. Like
change attribute (valid/invalid/deleted) in our dynamoDB table.
Yes, we can also renounce the use of moving files. But i don't see a way how to execute lambda functions after a delay of 30 days.
Best regards
It is possible to schedule lambda function execution using CloudWatch scheduled events
The design should include:
Implementation of Lambda function which identifies and moves required objects in S3
Scheduled expression rule with desired schedule. Cloudwatch supports crontab expression like aws events put-rule --schedule-expression
"cron(0 0 1 * ? *)" --name ObjectExpirationRule
Assign the rule to the lambda aws events put-targets --rule ObjectExpirationRule --targets '[{"1", "arn:aws:lambda:us-east-1:123456789012:function:MyLambdaFunction"}]'

Can i story config in memory and use it in AWS Lambda

I have lambda function which listens to dynamo stream and process records for any update or insert in dynamo.
Currently this lambda code has list of variables which i want to convert to a config, since this list can change.
So i want my lambda function to read this list from a config, but i don't want any network call so i cant make call to s3/dynamo every time. i want this config stored locally in memory.
I want to initialize lambda and in this initialization read this config from table and store it in some variable and use it in every invocation.
Can i do this?
I have my lambda functions (nodejs) read static config files from a yaml file. You could do the same with a json file as needed. The app also reads dynamic data in from S3 at run time, noting that this is not what you want to do.
This means I was able to move the variables out of the code as hard-coded values, and have a separate config file that you can change pre-deployment with CI tools or such per environment. It also means you can exclude your config from your version control if needed.
The only downside is the config has to be uploaded with the lamda function when you deploy it, so it's available with the other lambda assets at run time. AFAIK you can't write back to a config during runtime.
You can see in the project folder I have a config.yml. I'm using a nodejs module node-yaml-config to load into memory the config file each time the lambda is instantiated. It doesn't require any network call either.
In the config file I have all the params I need:
# Handler runtime config set
default:
sourceRssUri: http://www.sourcerss.com/rss.php?key=abcd1234
bucket: myappbucket
region: us-east-1
dataKey: data/rssdata
dataOutKey: data/rssdata
rssKey: myrss.xml
I load the config in at runtime, and then can reference any config items in my code by the key name. I just so happen to be using it for s3 operations here, you can do whatever.
const yaml_config = require("node-yaml-config");
const config = yaml_config.load(__dirname + "/config.yml");
const aws = require("aws-sdk");
const bbpromise = require("bluebird");
const s3 = bbpromise.promisifyAll(new aws.S3({}));
var params = {
Bucket: config.bucket,
Key: config.dataOutKey,
Body: JSON.stringify(feed.entries),
ContentType: "application/json"
};
s3.putObjectAsync(params).catch(SyntaxError, function(e) {
console.log("Error: ", e);
}).catch(function(e) {
console.log("Catch: ", e);
});
This makes it super easy to add new configuration for the lambda handler, as anything I add to config.yml such as myNewVariable is now available to reference in the handler as config.myNewVariable without any new work.
It allows the config to change per environment, or before each deployment. The config is then loaded before the handler and stored locally in memory during the period of the lambda execution.
No you can't. Lambda is stateless - you can't count on anything you read into memory on one invocation to be available to the next invocation. You will need to store your config information somewhere, and read it back in each time.