Automate uploading files stored locally to Cloud Storage using gsutil - google-cloud-platform

I'm new to GCP, I'm trying to build an ETL stream that will upload data from files to BigQuery. It seems to me that the best solution would be to use gsutil. The steps I see today are:
(done) Downloading the .zip file from the SFTP server to the virtual machine
(done) Unpacking the file
Uploading files from VM to Cloud Storage
(done) Automatically upload files from Cloud Storage to BigQuery
Steps 1 and 2 would be performed according to the schedule, but I would like step 3 to be event driven. So when files are copied to a specific folder, gsutil will send them to the specified bucket in Cloud Storage. Any ideas how can this be done?

Assuming you're running on a Linux VM, you might want to check out inotifywait, as mentioned in this question -- you can run this as a background process to try it out, e.g. bash /path/to/my/inotify/script.sh &, and then set it up as a daemon once you've tested it out and got something working to your liking.

Related

Strapi - how to switch and migrate from Cloudinary to S3 in production

Given quite a steep cost of Cloudinary as multimedia hosting service (images and videos), our client decided that they want to switch to AWS S3 as file hosting.
The problem is that there are a lot of files (thousands of images and videos) already in the app, so merely switching the provider is not enough - we need to also migrate all the files and make it look like nothing really changed for the end user.
This topic is somehow covered on Strapi forum: https://forum.strapi.io/t/switch-from-cloudinary-to-s3/15285, but there is no solution posted besides vaguely described procedure.
Is there a way to reliably perform the migration, without losing any data and without the need to change anything on client (apps that communicate with Strapi by REST/GraphQL API) side?
There are three steps to perform the migration:
switch provider from Cloudinary to S3 in Strapi
migrate files from Cloudinary to S3
perform database update to reroute Strapi from Cloudinary to S3
Switching provider
This is the only step that is actually well documented, so I will be brief here.
First, you need to uninstall your Cloudinary Strapi plugin by running yarn remove #strapi/provider-upload-cloudinary and install S3 Plugin by running yarn add #strapi/plugin-sentry.
After you do that, you need to create your AWS infrastructure (S3 bucket and IAM with sufficient permissions). Please follow official Strapi S3 plugin documentation https://market.strapi.io/providers/#strapi-provider-upload-aws-s3 and this guide https://dev.to/kevinadhiguna/how-to-setup-amazon-s3-upload-provider-in-your-strapi-app-1opc for steps to follow.
Check that you've done everything correctly by logging in to your Strapi Admin Panel and accessing Media Library. If everything went well, all images should be missing (you will see all metadata like sizes and extensions, but not actual images). Try to upload new image by clicking on 'Add new assets' button. This image should upload successfully and also appear in your S3 bucket.
After everything works as described above, proceed to actual data migration.
Files migration
Most simple (and error resistant) way to migrate files from Cloudinary to S3 is to download them locally, then use AWS Console to upload them. If you have only hundreds (or low thousands) of files to migrate, you might actually used Cloudinary Web UI to download them all (there is a limit of downloading 1000 files at once from Cloudinary Web App).
If this is not suitable for you, there is a CLI available that can easily download all files using your terminal:
pip3 install cloudinary-cli (download CLI)
cld config -url {CLOUDINARY_API_ENV} (api env can be found on first page you see when you log into cloudinary)
cld -C {CLOUD_NAME} sync --pull . / (This step begins the download. Based on how much files you have, it might take a while. Run this command from a directory you want to download the files in. {CLOUD_NAME} can be find just above {CLOUDINARY_API_ENV} on Cloudinary dashboard, you should also see it in after running second command in your terminal. For me, this command failed several times in the middle of the download, but you can just run it again and it will continue without any problem.)
After you download files to your computer, simply use drag and drop S3 feature to upload them into your S3 bucket.
Update database
Strapi saves links to all files in database. This means that even though you switched your provider to S3 and copied all files, Strapi still doesn't know where to find these files as links in database point to Cloudinary server.
You need to update three columns in Strapi database (this approach is tested on Postgres database, there might be minor changes when using other databases). Look into 'files' table, there should be url, formats and provider columns.
Provider column is trivial, just replace cloudinary by aws-s3.
Url and formats are harder as you need to replace only part of the string - to be more precise, Cloudinary stores urls in {CLOUDINARY_LINK}/{VERSION}/{FILE} format, while S3 uses {S3_BUCKET_LINK}/{FILE} format.
My friend and colleague came up with following SQL query to perform the update:
UPDATE files SET
formats = REGEXP_REPLACE(formats::TEXT, '\"https:\/\/res\.cloudinary\.com\/{CLOUDINARY_PROJECT}\/((image)|(video))\/upload\/v\d{10}\/([\w\.]+)\"', '"https://{BUCKET_NAME}.s3.{REGION}/\4"', 'g')::JSONB,
url = REGEXP_REPLACE(url, 'https:\/\/res\.cloudinary\.com\/{CLOUDINARY_PROJECT}\/((image)|(video))\/upload\/v\d{10}\/([\w\.]+)', 'https://{BUCKET_NAME}.s3.{REGION}/\4', 'g')
just don't forget to replace {CLOUDINARY_PROJECT}, {BUCKET_NAME} and {REGION} with correct strings (easiest way to see those values is to access the database, go to files table and check one of the old urls and url of file you uploaded at the end of Switching provider step.
Also, before running the query, don't forget to backup your database! Even better, make a copy of production database and run the query on it before you mess with the production.
And that's all! Strapi is now uploading files to S3 bucket and you also have access to all the data you previously had on Cloudinary.

Data streaming from raspberry pi CSV file to BigQuery table

I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer

Google Cloud Bucket mounted on Compute Engine Instance using gcsfuse does not create files

I have been able to mount Google Cloud Bucket using
gcsfuse --implicit-dirs " production-xxx-appspot /mount
or equally
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=service-account.json production-xxx-appspot /mount
Mounting works fine.
What happens is that when I execute the following commands after mounting, they also work fine :
mkdir /mount/files/
cp -rf /home/files/* /mount/files/
However, when I use :
mcedit /mount/files/a.txt
or
vi /mount/files/a.txt
The output says that there is no file available which makes sense.
Is there any other way to cover this situation, and use applications in a way that they can directly create files on the mounted google cloud bucket rather than creating files locally and copying afterwards.
If you do not want to create files locally and upload later, you should consider using a file storage system like Google Drive
Google Cloud storage is an object Storage system that means objects cannot be modified, you have to write the object completely at once. Object storage also does not work well with traditional databases, because writing objects is a slow process and writing an app to use an object storage API is not as simple as using file storage.
In a file storage system, Data is stored as a single piece of information inside a folder, just like you would organize pieces of paper inside a manila folder. When you need to access that piece of data, your computer needs to know the path to find it. (Beware—It can be a long, winding path.)
If you want to use Google Cloud Storage, you need to create your file locally and then push it to your bucket.
Here are an example of how to configure Google Cloud Storage with Node.js: File Upload example
Here is a tutorial on How to mount Object Storage on Cloud Server using s3fs-fuse
If you want to know more about storage formats please follow this link
More information about reading and writing to Cloud Storage in this link

Download checkpoint from AWS

How can I download the checkpoints and logged statistics after I run my deep learning algorithm on AWS using SageMaker?
When you created the training job (notebook or GUI it doesn't matter) you have defined an output directory, that usually is a S3 bucket. At the end of the training job, sagemaker automatically upload everything (with everything I mean everything that is contained in /opt/ml/model, if you use a predefined container this is automatic, otherwise it's up to you to write you artifact there) in a compressed archive to that output directory. So simply download the archive from S3.

Airflow run dataproc job with code that sits in git repository

I'm looking at the documentation of the DataProcPySparkOperator to understand where to send the code file for the pyspark job and the dependencies files (pyfiles). As I understand I should use the "main" and "pyfiles" arguments.
But it's not clear where these files should exist. Can I give a link to git and they will be taken from there, or should I use Google cloud storage (in my case I'm on Google cloud)?
Or should I handle the copy of the files by myself and then provide a link to the master storage?
You need to pass it in main. It can be a local python file or a file on GCS, both are supported. In the case where the file is local, Airflow uploads it to GCS and passed that path to the Dataproc API.