How to read file names from GCP buckets recursively using a composer DAG - google-cloud-platform

I am trying to read the file names from GCS bucket recursively from all folders, subfolders under the bucket using a composer DAG. Is it possible? For example i have bucket with corresponding folders and sub folders as mentioned below. static is the bucket name.
static/folder1/subfolder1/file1.json
static/folder1/subfolder2/file2.json
static/folder1/subfolder3/file3.json
static/folder1/subfolder3/file4.json
I want to read the files recursively and put the data in two variables like below.
bucketname = static
filepath = static/folder1/subfolder3/file4.json

You can use Airflow's BashOperator to use the GCS CLI tool (docs here).
An example could be the following:
read_files = BashOperator(
task_id='read_files',
bash_command='gsutil ls -r gs://bucket',
dag=dag,
)
Edit: Since you want to capture the output and the BashOperator only pushes the last line of stdout to XCom, I would suggest using a PythonOperator which calls a custom Python callable which uses either the GCS API or even the CLI tool via subprocess to collect all file names and push it to XCom for subsequent use by downstream tasks. Unless you do not need other tasks to use this data at all, in which case you can process it however you like (not clear from the question).

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

I have a folder with several files corresponding to checkpoints of a RL model trained using RLLIB. I want to make an analysis of the checkpoints in a way that I need to pass a certain folder as an argument, e.g., analysis_function(folder_path). I have to run this line on a SageMaker notebook. I have seen that there are some questions on SO about how to retrieve files from s3, such as this one. However; how can I retrieve a whole folder?
To read the whole folder, you will just have to list all files in the folder and loop through them. You could either do something like -
import boto3
s3_res = boto3.resource("s3")
my_bucket = s3.Bucket("<your-bucket-name>")
for object in my_bucket.objects.filter(Prefix="<your-prefix>")
# your code goes here
Or, simply download the files to your local storage and loop them as you see fit (copy reference)-
!aws s3 cp s3://bucket/prefix/ . --recursive

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

What is the boto3 equivalent of the following AWS CLI command?

I am trying to write the following AWS CLI command in boto3:
aws s3api list-objects-v2 --bucket bucket --query "Contents[?LastModified>='2020-02-20T17:00:00'] | [?LastModified<='2020-02-20T17:10:00']"
My goal is to be able to list only those s3 files in a specified bucket that were created within a specific interval of time. In the example above, when I run this command it returns only files that were created between 17:00 and 17:10.
When looking at the boto3 documentation, I am not seeing anything resembling the '--query' portion of the above command, so I am not sure how to go about returning only the files that fall within a specified time interval.
(Please note: Listing all objects in the bucket and filtering from the larger list is not an option for me. For my specific use case, there are simply too many files to go this route.)
If a boto3 equivalent does not exist, is there a way to launch this exact command via a Python script?
Thanks in advance!
There's no equivalent one from boto3. But s3pathlib solves your problem. Here's the code snippet solves your problem:
from datetime import datetime
from s3pathlib import S3path
# define a bucket
p_bucket = S3Path("bucket")
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable attribute can be used for filtering
S3Path.last_modified_at.between(
datetime(2000,1,1),
datetime(2020,1,1),
)
):
# do what ever you like
print(p.console_url) # click link to open it in console, inspect results
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

Rename and Move S3 files based on their folders name in pyspark

I am wringing some dataframes using partitionBy to S3. The folder structure that gets created is as below.
root/
date=2018-01-01/
date=2018-01-02/
I want to move these files to another directory in s3 and rename the folders as
root1/
20180101/
20180102/
Is there a way that I can achieve this from pyspark?
Also i need the files to be renamed in a sequential way inside the directories,e.g :
root1/
20180101/FILE_1.csv
20180101/FILE_2.csv
You can't directly rename S3 objects.
So one way to achieve this is to copy the objects to have desirable name and then delete the original objects.
Also, S3 buckets do not have a directory structure, "directory structure" is just prefixes in the objects' keys.
You have two options, either call aws cli from python using subprocees or use boto3 library to copy all the files from one "directory" to another.
Solution using subprocess:
import subprocess
subprocess.check_call("aws s3 sync s3://bucket/root/date=2018-01-01/ s3://bucket/root1/20180101/".split())
sync command will copy recursively. Then you can use aws s3 rm --recursive "somepath" . Call it using subprocess again.

Sync command for OpenStack Object Storage (like S3 Sync)?

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?
The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?