How can I use GCS Delete in Data Fusion Studio? - google-cloud-platform

Apologies if this is very simple but I am a complete beginner at GCP.
I've created a pipline that picks up multiple CSVs from a bucket, wrangles them then writes them into BigQuery. I want it to then delete the contents of the bucket folder the files came from. So let's say I pulled the CSVs using gs://bucket/Data/Country/*.CSV can I use GCS Delete to get rid of all the CSVs in there?
As a desperate attempt :D, in the Objects to delete, I specified gs://bucket/Data/Country/*.* but this didn't do a thing.

According to the Google Cloud Storage Delete plugin documentation its necessary to put each object separating it by comma.
There are feature request asking for the possibility to allow suffixes and prefixes when using this plugin, you can use the +1 button and provide your feedback about how this feature could be useful.
On the other hand, I thought in a workaround that could be work for you. Using the GCS documentation I have created an script to list all csv objects in a bucket, you only have to copy & paste the output in the Objects to Delete property of the plugin. Its important to mentioned that I used this workaround with 100 files more-less, I'm not sure if it's feasible to use with a larger amount of files.
from google.cloud import storage
bucket_name="MY_BUCKET"
file_format="csv"
def list_csv(bucket_name):
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
if file_format in blob.name:
print("gs://"+ bucket_name + "/" + blob.name+",")
return None
list_csv(bucket_name)

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

S3 move files year/month wise

I have a bucket (s3://Bucket1) and there are millions of files in that with format like below:
s3://Bucket1/yyyy-mm-dd/
I want to move these files like
s3://Bucket1/year/mm
Any help, script, method will be really helpful.
I have tried aws s3 cp s3://Bucket1/ s3://Bucket1/ --include "2017-01-01*" but this is not working good and plus I have to put extra stuff to delete files.
The basic steps are:
Get a list of objects
Copy the objects to the new name
Delete the old objects
Get a list of objects
Given that you have millions of files, the best way to start is to use Amazon S3 Inventory to obtain a CSV file of all the objects.
Copy the objects to the new name
Then, write a script that reads the CSV file and issues a copy() command to copy the file to the new location. This could be written in any language that has an AWS SDK (eg Python).
Delete the old objects
Rather than individually deleting the objects, use S3 object lifecycle management to delete the old files. The benefits of using this method are:
There is no charge for the delete (whereas issuing millions of delete commands would involve a charge)
It can be done after the files have been copied, providing a chance to verify that all the files have been correctly copied (by checking the next S3 inventory output)
You could use the AWS CLI to issue a aws s3 mv command, which will combine the copy and delete -- effectively providing a rename function. However, shell scripts aren't that easy and if things fail half-way the files will be in a mixed state. That's why I prefer the "copy all objects, and only then delete" method more.

python boto for aws s3, how to get sorted and limited files list in bucket?

If There are too many files on a bucket, and I want to get only 100 newest files,
How can I get only these list?
s3.bucket.list seems not to have that function. Is there anybody who know this?
please let me know. thanks.
There is no way to do this type of filtering on the service side. The S3 API does not support it. You might be able to accomplish something like this by using prefixes in your object names. For example, if you named all of your objects using a pattern like this:
YYYYMMDD/<objectname>
20140618/foobar (as an example)
you could use the prefix parameter of the ListBucket request in S3 to return only the object that were stored today. In boto, this would look like:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
for key in bucket.list(prefix='20140618'):
# do something with the key object
You would still have to retrieve all of the objects with that prefix and then sort them locally based on their last_modified_date but that would be much easier than listing all of the objects in the bucket and then sorting.
The other option would be to store metadata object the S3 objects in a database like DynamoDB and then query that database to find the objects to retrieve from S3.
You can find out more about hierarchical listing in S3 here
Can you try this code. This worked for me.
import boto,operator,time
con = boto.connect_s3()
key_repo = []
bucket = con.get_bucket('<your bucket name>')
bucket_keys = bucket.get_all_keys()
for object in bucket_keys:
t = (object.key,time.strptime(object.last_modified[:19], "%Y-%m-%dT%H:%M:%S"))
key_repo.append(t)
key_repo.sort(key=lambda item:item[1], reverse=1)
for key in key_repo[:10]: #top 10 items in the list
print key[0], ' ',key[1]
PS : I am beginner to Python so the code might not be optimized. Fell free to edit the answer to provide best code.