AWS S3 copy to bucket from remote location - amazon-web-services

There is a large dataset on a public server (~0.5TB, multi-part here), which I would like to copy into my own s3 buckets. It seems like aws s3 cp is only for local files or files based in S3 buckets?
How can I copy that file (either single or multi-part) into S3? Can I use the AWS CLI or do i need to something else?

There's no way to upload it directly to S3 from the remote location. But you can stream the contents of the remote files to your machine and then up to S3. This means that you will have downloaded the entire 0.5TB of data, but your computer will only ever hold a tiny fraction of that data in memory at a time (it will not be persisted to disc either). Here is a simple implementation in javascript:
const request = require('request')
const async = require('async')
const AWS = require('aws-sdk')
const s3 = new AWS.S3()
const Bucket = 'nyu_depth_v2'
const baseUrl = 'http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/'
const parallelLimit = 5
const parts = [
'basements.zip',
'bathrooms_part1.zip',
'bathrooms_part2.zip',
'bathrooms_part3.zip',
'bathrooms_part4.zip',
'bedrooms_part1.zip',
'bedrooms_part2.zip',
'bedrooms_part3.zip',
'bedrooms_part4.zip',
'bedrooms_part5.zip',
'bedrooms_part6.zip',
'bedrooms_part7.zip',
'bookstore_part1.zip',
'bookstore_part2.zip',
'bookstore_part3.zip',
'cafe.zip',
'classrooms.zip',
'dining_rooms_part1.zip',
'dining_rooms_part2.zip',
'furniture_stores.zip',
'home_offices.zip',
'kitchens_part1.zip',
'kitchens_part2.zip',
'kitchens_part3.zip',
'libraries.zip',
'living_rooms_part1.zip',
'living_rooms_part2.zip',
'living_rooms_part3.zip',
'living_rooms_part4.zip',
'misc_part1.zip',
'misc_part2.zip',
'office_kitchens.zip',
'offices_part1.zip',
'offices_part2.zip',
'playrooms.zip',
'reception_rooms.zip',
'studies.zip',
'study_rooms.zip'
]
async.eachLimit(parts, parallelLimit, (Key, cb) => {
s3.upload({
Key,
Bucket,
Body: request(baseUrl + Key)
}, cb)
}, (err) => {
if (err) console.error(err)
else console.log('Done')
})

Related

Can write to google cloud storage from one machine but not

I have a weird bug where I'm able to write to my cloud storage bucket from one machine but not another. I can't tell if the issue is vercel or if it's my configurations but the app is deployed on vercel so it should be the same no matter where I'm accessing it.
upload.ts
export const upload = async (req: IncomingMessage, userId: string) => {
const storage = new Storage({
// credentials
});
const bucket = storage.bucket(process.env.GCS_BUCKET_NAME as string);
const form = formidable();
const { files } = await parseForm(form, req);
const file = files.filepond;
const { path } = file;
const options = {
destination: `products/${userId}/${file.name}`,
preconditionOpts: {
ifGenerationMatch: 0
}
};
await bucket.upload(path, options);
}
Again, my app is deployed on Vercel and I'm able to upload images on my own machine but can't do it if I try on my phone or another pc/mac. My cloud storage bucket is also public so I should be able to read/write to it from anywhere. Any clues?

uplode image to amazon s3 using #aws-sdk/client-s3 ang get its location

i am trying upload an in image file to s3 but get this error says :
ERROR: MethodNotAllowed: The specified method is not allowed against this resource.
my code using #aws-sdk/client-s3 package to upload wth this code :
const s3 = new S3({
region: 'us-east-1',
credentials: {
accessKeyId: config.accessKeyId,
secretAccessKey: config.secretAccessKey,
}
});
exports.uploadFile = async options => {
options.internalPath = options.internalPath || (`${config.s3.internalPath + options.moduleName}/`);
options.ACL = options.ACL || 'public-read';
logger.info(`Uploading [${options.path}]`);
const params = {
Bucket: config.s3.bucket,
Body: fs.createReadStream(options.path),
Key: options.internalPath + options.fileName,
ACL: options.ACL
};
try {
const s3Response = await s3.completeMultipartUpload(params);
if (s3Response) {
logger.info(`Done uploading, uploaded to: ${s3Response.Location}`);
return { url: s3Response.Location };
}
} catch (err) {
logger.error(err, 'unable to upload:');
throw err;
}
};
I am not sure what this error mean and once the file is uploaded I need to get his location in s3
thanks for any help
For uploading a single image file you need to be calling s3.upload() not s3.completeMultipartUpload().
If you had very large files and wanted to upload then in multiple parts, the workflow would look like:
s3.createMultipartUpload()
s3.uploadPart()
s3.uploadPart()
...
s3.completeMultipartUpload()
Looking at the official documentation, It looks like the new way to do a simple S3 upload in the JavaScript SDK is this:
s3.send(new PutObjectCommand(uploadParams));

How to invoke AWS CLI command from CodePipeline?

I want to copy artifacts from S3 bucket in Account 1 to S3 bucket in Account 2. Though I was able to setup replication but I want to know whether there is a way to invoke AWS CLI command from within a pipeline.
Can it be invoked using Lambda function? If yes, any small sample script will be helpful.
Yes, you can add a Lambda Invoke action to your pipeline to call the copyobject API. The core part of the Lambda function is as follow.
exports.copyRepoToProdS3 = (event, context) => {
const jobId = event['CodePipeline.job'].id
const s3Location = event['CodePipeline.job'].data.inputArtifacts[0].location.s3Location
const cpParams = JSON.parse(event['CodePipeline.job'].data.actionConfiguration.configuration.UserParameters)
let promises = []
for (let bucket of prodBuckets) {
let params = {
Bucket: bucket,
CopySource: s3Location['bucketName'] + '/' + s3Location['objectKey'],
Key: cpParams['S3ObjectKey']
}
promises.push(s3.copyObject(params).promise())
}
return Promise.all(promises)
.then((data) => {
console.log('Successfully copied repo to buckets!')
}).catch((error) => {
console.log('Failed to copy repo to buckets!', error)
})
}
And more detailed steps to add roles and report processing result to CodePipeline can be find at the following link. https://medium.com/#codershunshun/how-to-invoke-aws-lambda-in-codepipeline-d7c77457af95

Firehose record format conversion partitions

I tried to to use the new firehose feature "record format conversion" to save my events as parquet files for athena or hive aggregations. You have to select the table from your glue catalog, but firehose ignores the defined partions, and instead saves the files in the structure YYYY/MM/DD/HH/. The data also is missing the defined partition columns. This would be ok, if it has used it for partitioning.
Is there a API configuration, or something else, to force to use the table partitioning?
I have exact the same issue, even with same partitioning
So you have to use AWS lambda to achieve what you want
one to move files generated by Firehose to bucket that is used by Athena.
another one to trigger refresh Athena table, as it will not see new folders
(I don't put all triggers, but this should be just a call 'MSCK REPAIR TABLE your_table_name;')
For the first one I choose NodeJs, as this is really simple and really fast.
~ 3 seconds to move 120MB file with minimum AWS allowed 128MB RAM memory allocation(files generated by Firehose will be approximate 64MB max)
Node js project structure
package.json
{
"name": "your.project",
"version": "1.0.0",
"description": "Copy generated partitioned files by Firehose to valid partitioned files for Athena",
"main": "index.js",
"dependencies": {
"async": "^2.6.1"
}
}
And index.js
const aws = require('aws-sdk');
const async = require('async');
const s3 = new aws.S3();
const dstBucket = 'PUT_YOUR_BUCKET_NAME_HERE';
var util = require('util');
exports.handler = (event, context, callback) => {
const srcBucket = event.Records[0].s3.bucket.name;
const srcKey = event.Records[0].s3.object.key;
const split = srcKey.split('/');
const dstKey = `event_year=${split[0]}/event_month=${split[1]}/event_day=${split[2]}/event_hour=${split[3]}/${split[4]}`;
console.log("Reading options from event:\n", util.inspect(event, {depth: 10}));
async.waterfall([
function copy(next) {
s3.copyObject({
Bucket: dstBucket,
CopySource: `${srcBucket}/${srcKey}`,
Key: dstKey
}, next);
},
function deleteOriginal(copyResult, next) {
s3.deleteObject({
Bucket: srcBucket,
Key: srcKey
}, next);
}
], function (err) {
if (err) {
console.error(`Failed: ${srcBucket}/${srcKey} => ${dstBucket}/${dstKey} to move FireHose partitioned object to Athena partitioned object. Error: ${err}`);
} else {
console.log(`Success: ${srcBucket}/${srcKey} => ${dstBucket}/${dstKey} moved FireHose partitioned object to Athena partitioned object`);
}
callback(null, 'move success');
}
);
};
Just update some data to be valid for your case.
And one more issue that I got, is when build a project with
npm install
and zip it, it was something not correct in AWS unzip, so I have to update path to my index.js.
And this works.
You can also find this line
console.log("Reading options from event:\n", util.inspect(event, {depth: 10}));
It can be removed, but greatly helps to understand details of processing object

AWS SDK connection - How is this working?? (Beginner)

I am working on my AWS cert and I'm trying to figure out how the following bit of js code works:
var AWS = require('aws-sdk');
var uuid = require('node-uuid');
// Create an S3 client
var s3 = new AWS.S3();
// Create a bucket and upload something into it
var bucketName = 'node-sdk-sample-' + uuid.v4();
var keyName = 'hello_world.txt';
s3.createBucket({Bucket: bucketName}, function() {
var params = {Bucket: bucketName, Key: keyName, Body: 'Hello'};
s3.putObject(params, function(err, data) {
if (err)
console.log(err)
else
console.log("Successfully uploaded data to " + bucketName + "/" + keyName);
});
});
This code successfully loads a txt file containing the words "Hello" in it. I do not understand how this ^ can identify MY AWS account. It does! But how! It somehow is able to determine that I want a new bucket inside MY account, but this code was taken directly from the AWS docs. I don't know how it could figure that out....
As per Class: AWS.CredentialProviderChain, the AWS SDK for JavaScript looks for credentials in the following locations:
AWS.CredentialProviderChain.defaultProviders = [
function () { return new AWS.EnvironmentCredentials('AWS'); },
function () { return new AWS.EnvironmentCredentials('AMAZON'); },
function () { return new AWS.SharedIniFileCredentials(); },
function () {
// if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is set
return new AWS.ECSCredentials();
// else
return new AWS.EC2MetadataCredentials();
}
]
Environment Variables (useful for testing, or when running code on a local computer)
Local credentials file (useful for running code on a local computer)
ECS credentials (useful when running code in Elastic Container Service)
Amazon EC2 Metadata (useful when running code on an Amazon EC2 instance)
It is highly recommended to never store credentials within an application. If the code is running on an Amazon EC2 instance and a role has been assigned to the instance, the SDK will automatically retrieve credentials from the instance metadata.
The next best method is to store credentials in the ~/.aws/credentials file.