I have train.manifest and validation.manifest in my S3 bucket. I need to combine them to a file, but I don't know how to combine them
Related
My doubt is related with the files saved after run a query in AWS Athena. It will save two files in S3 bucket connected to workgroup, but I only want the csv.metadata. Is it a way to only create the file csv.metadata instead of create the two files?
Thanks
I have lots of .tar.gz files (millions) stored in a bucket on Amazon S3.
I'd like to untar them and create the corresponding folders on Amazon S3 (in the same bucket or another).
Is it possible to do this without me having to download/process them locally?
It's not possible with only S3. You'll have to have something like EC2, ECS or Lambda, preferably running in the same region as the S3 bucket, to download the .tar files, extract them, and upload every file that was extracted back to S3.
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/
I am using use an API (https://scihub.copernicus.eu/userguide/OpenSearchAPI) to download a large number (100+) of large files (~5GB each) and I want to store these files on an AWS s3 bucket.
My first iteration was to download the files locally and use AWS CLI to move them to an S3 bucket: aws s3 cp <local file> s3://<mybucket>, and this works.
To avoid downloading locally I used an ec2 instance and basically did the same from there. The problem however is that the files are quite large so I'd prefer to not even have to store the files and use my ec2 instance to kind of stream the files to my S3 bucket.
Is this possible?
You can use a byte array to populate an Amazon S3 bucket. For example, assume you are using the AWS SDK for Java V2. You can put an object into a bucket like this:
PutObjectRequest putOb = PutObjectRequest.builder()
.bucket(bucketName)
.key(objectKey)
.metadata(metadata)
.build();
PutObjectResponse response = s3.putObject(putOb,
RequestBody.fromBytes(getObjectFile(objectPath)));
Notice RequestBody.fromBytes method. Full example here:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/PutObject.java
One thing to note however. If your files are really large, you may want to consider uploading in parts. See this example:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/S3ObjectOperations.java
I have a S3 bucket with a folder called batches. Inside the batches folder, I have 20 CSV files. Using AWS ClI (or a bash file to be exact), how can I combine all these csv files into a single CSV file and move it up one folder.
Normally in terminal this is how I do it:
cd batches && cat ./*csv > combined.csv
What would be a comparable way to do this for an S3 bucket inside AWS CLI?
If you only have 20 CSV files in total, then Marcin's suggestion is probably the right way to go. If you have more bigger-sized CSV, then I would suggest taking advantage of multipart upload in S3 and processing them in aws lambda.
I have many csv files under different sub-directory in S3.
I try to append all data with one csv file.
Is there any way to append all files using s3 or other AWS services?
Thanks
If the result csv is not very large (less than couple of Gb), you can use AWS Lambda to go through all subdirectories (keys) in S3 and write result file [for example into S3 again].
Also you can use AWS Glue for this operation, but I didn't use it.
In both cases you should write some script for joining files.