how to access files in s3 bucket for other commands using cli - amazon-web-services

I want to use some commands with the aws cli on large files that are stored in a s3 bucket without copying the files to the local directory
(I'm familiar with the aws cp command, it's not what I want).
For example, let's say I want to use a simple bash commands like "head" or "more".
If I try to use it like this:
head s3://bucketname/file.txt
but then I get:
head: cannot open ‘s3://bucketname/file.txt’ for reading: No such file or directory
How else can I do it?
How else can I do it?
Thanks in advance.

Wether a command will be able to access a file over s3 bucket or not depends entirely on the command. Under the hood, every command is just a program. When you run something like head filename, the argument filename is passed as an argument to the head command's main() function. You can check out the source code here: https://github.com/coreutils/coreutils/blob/master/src/head.c
Essentially, since the head command does not support S3 URIs, you cannot do this. You can either:
Copy the s3 file to stdout and then pipe it to head: aws s3 cp fileanme - | head. This doesn't seem the likely option if file is too big for the pipe buffer.
Use s3curl to copy a range of bytes: how to access files in s3 bucket for other commands using cli

Related

AWS S3 ls command returns the folder name too along with files in it

This might seem like a stupid question but I am not able to find a reason for this. When I am running command aws s3 ls on S3 URI, it gives the name of parent folder in output for some of the buckets and for some it will just list the files in the folder.
Example:
aws s3 ls s3://test-bucket/test_folder/ --recursive --human-readable --summarize
2022-06-28 20:04:36 0 Bytes test_folder/
2022-06-28 20:05:58 3.0 KiB test_folder/file.txt
and for another s3 URI it will just list the contents
aws s3 ls s3://sample_/sample_test/ --recursive --human-readable --summarize
2021-06-29 03:24:08 5.2 sample_test/file1.txt
2021-06-29 03:24:07 7.0 sample_test/file2.txt
2021-06-29 03:24:08 5.1 sample_test/file3.txt
I am not sure what is causing this behavior, is there any documentation which I am missing here
Thanks
This is likely because someone used the S3 console to explicitly create a 'folder' named test_folder but they didn't do that for sample_test. They simply uploaded 3 files to sample_test.
What you see as the folder test_folder/ is simply an S3 object whose key is test_folder/ and whose size is zero. It doesn't need to exist for you to be able to upload files to test_folder/. It's just a visual convenience in the S3 console.
There are typically no real folders in S3. They're virtual, and inferred from the presence of multiple objects with a common prefix ending in forward slash e.g. dogs/bingo.png and dogs/elvis.jpg implies the presence of a virtual folder named dogs/, but it doesn't really exist (typically).

AWS CLI - A file containing items to be ignored for S3 copy or sync

Is it somewhat possible to have a file containing ignored files and folders during uploading items through AWS CLI.
It has an --exclude flag like mentioned here. However, the concept I seek is something like .gitignore or .dockerignore file rather than enlisting with a flag.
No, there is no in-built capability within the AWS Command-Line Interface (CLI) to support .ignore file capabilities.
I know it's not exactly what you are looking for but you could set an alias in your ~/.bash_profile something like:
alias s3_cp=`aws s3 cp --exclude "yadda, yadda, yadda"`
This would at least reduce the need to type them every time, even though it isn't in a concise file.
Edit: Here is a link that shows it doesn't look like the base config file supports what you are looking for. https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

How to copy only files from many subdirectory under the directory to another project bucket in GCP?

I have huge number of data in my Google Cloud storage bucket. I have to copy all the files to another project bucket. But the main problem is, in this bucket i created some folder and under this folder have many sub-folders and all sub-folders have data. So when i am using normal gsutil copy command then its copying all the data along with folders.
I need help to resolve this problem. Because it is taking too much time to copy from one project to another project bucket.
You can use this command to have all the files in the root path.
gsutil cp 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
If you have nested directories inside your bucket, use this command:
gsutil cp -r 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
Pay attention to single quotes around the first command.
You can take a look at the Wildcard Names if you need more advanced features.
You can use Google Data Transfer Service
It is the second option in the Google Cloud Storage subcategory.
Use gsutil cp command without -r option.
The -R and -r options are synonymous. Causes directories,
buckets, and bucket subdirectories to be copied recursively.
If you neglect to use this option for an upload, gsutil will
copy any files it finds and skip any directories. Similarly,
neglecting to specify this option for a download will cause
gsutil to copy any objects at the current bucket directory
level, and skip any subdirectories.
If I understand well, you want to copy all the files from one bucket to another bucket, but you don't want to have the same hierarchy, instead, you want to have all the files in the root path.
Nowadays there’s no possible way to do that with gsutil, but you can do it with a script, here you have my solution:
from google.cloud import storage
bucketOrigin = storage.Client().get_bucket("<BUCKET_ID_ORIGIN>")
bucketDestination = storage.Client().get_bucket("<BUCKET_ID_DESTINATION")
for blob in bucketOrigin.list_blobs():
strfile=blob.download_as_string()
blobDest = bucketDestination.blob(blob.name[blob.name.rfind("/")+1:])
blobDest.upload_from_string(strfile)
As mentioned by Akash Dathan, you can use the Cloud Storage Transfer Service to move your bucket content. I recommend you to take a look on this Moving and Renaming Buckets guide, where you can find the steps required to perform this task.
Bear in mind the following requirments:
Transfer Service service account must have permission to read from
your source and write to your destination.
If you're deleting the source files, the Transfer Service's service account will need delete access to the source.
If your service account doesn't have these
permissions yet, a bucket owner must grant them.
Note. If you have 'storage.buckets.setIamPolicy' permission for the source and destination buckets, creating a transfer job will grant that service account the required source and destination permissions to complete the transfer.
You can list all files from your subfolders and get the file name by using split() method. Then you can use use a copy() method to copy the file to another bucket. The method below remove all subfolders:
const [files] = await storage.bucket(srcBucketName).getFiles();
files.forEach((file) => {
let fileName = file.name.split("/").pop();
if (fileName)
file.copy(storage.bucket(destBucketName).file(`${prefix}/${fileName}`));
});

Remove processed source files after AWS Datapipeline completes

A third party sends me a daily upload of log files into an S3 bucket. I'm attempting to use DataPipeline to transform them into a slightly different format with awk, place the new files back on S3, then move the original files aside so that I don't end up processing the same ones again tomorrow.
Is there a clean way of doing this? Currently my shell command looks something like :
#!/usr/bin/env bash
set -eu -o pipefail
aws s3 cp s3://example/processor/transform.awk /tmp/transform.awk
for f in "${INPUT1_STAGING_DIR}"/*; do
basename=${f//+(*\/|.*)}
unzip -p "$f" | awk -f /tmp/transform.awk | gzip > ${OUTPUT1_STAGING_DIR}/$basename.tsv.gz
done
I could use the aws cli tool to move the source file aside on each iteration of the loop, but that seems flakey - if my loop dies halfway through processing, those earlier files are going to get lost.
Few possible solutions:
Create a trigger on your s3 bucket.. Whenever any object added to the bucket --> invoke lambda function which can be a python script which performs transformation --> copies back to another bucket. Now, on this other bucket again lambda function is invoked which deletes file from first bucket.
I personally feel; what you have achieved is good enough..All you need is exception handling in the shell script and delete the file ( never loose data ) ONLY when output file is successfully created ( probably u can check the size of output file also )

Sync command for OpenStack Object Storage (like S3 Sync)?

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?
The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?