Airflow run dataproc job with code that sits in git repository - google-cloud-platform

I'm looking at the documentation of the DataProcPySparkOperator to understand where to send the code file for the pyspark job and the dependencies files (pyfiles). As I understand I should use the "main" and "pyfiles" arguments.
But it's not clear where these files should exist. Can I give a link to git and they will be taken from there, or should I use Google cloud storage (in my case I'm on Google cloud)?
Or should I handle the copy of the files by myself and then provide a link to the master storage?

You need to pass it in main. It can be a local python file or a file on GCS, both are supported. In the case where the file is local, Airflow uploads it to GCS and passed that path to the Dataproc API.

Related

Is there a way to handle files within google cloud storage?

There are many compressed files inside google cloud storage.
I want to unzip the zipped file, rename it and save it back to another bucket.
I've seen a lot of posts, but I couldn't find a way other than how to download it with gsutil and handle it.
Do you have any other way?
To modify a file, such as unzipping, you must read, modify and then write. This means download, unzip and upload the extracted files.
Use gsutil or another tool, one of the SDKs, or the REST APIs. To unzip a file, use a zip tool or one of the libraries that support zip operations.
You can start using cloudFunctions for it. CloudFunction will get triggered when the object is created/finalized and it will do the job automatically.
You can also use the same function to iterate over all the zip files, do the tasks and move same files to another bucket. One thing to make sure it may not be able to move all files in a single go, but for the files which already exists on the bucket, you can use local machine to run the program.
Here is the list of libraries to connect to CloudStorage

Google Cloud Bucket mounted on Compute Engine Instance using gcsfuse does not create files

I have been able to mount Google Cloud Bucket using
gcsfuse --implicit-dirs " production-xxx-appspot /mount
or equally
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=service-account.json production-xxx-appspot /mount
Mounting works fine.
What happens is that when I execute the following commands after mounting, they also work fine :
mkdir /mount/files/
cp -rf /home/files/* /mount/files/
However, when I use :
mcedit /mount/files/a.txt
or
vi /mount/files/a.txt
The output says that there is no file available which makes sense.
Is there any other way to cover this situation, and use applications in a way that they can directly create files on the mounted google cloud bucket rather than creating files locally and copying afterwards.
If you do not want to create files locally and upload later, you should consider using a file storage system like Google Drive
Google Cloud storage is an object Storage system that means objects cannot be modified, you have to write the object completely at once. Object storage also does not work well with traditional databases, because writing objects is a slow process and writing an app to use an object storage API is not as simple as using file storage.
In a file storage system, Data is stored as a single piece of information inside a folder, just like you would organize pieces of paper inside a manila folder. When you need to access that piece of data, your computer needs to know the path to find it. (Beware—It can be a long, winding path.)
If you want to use Google Cloud Storage, you need to create your file locally and then push it to your bucket.
Here are an example of how to configure Google Cloud Storage with Node.js: File Upload example
Here is a tutorial on How to mount Object Storage on Cloud Server using s3fs-fuse
If you want to know more about storage formats please follow this link
More information about reading and writing to Cloud Storage in this link

Download checkpoint from AWS

How can I download the checkpoints and logged statistics after I run my deep learning algorithm on AWS using SageMaker?
When you created the training job (notebook or GUI it doesn't matter) you have defined an output directory, that usually is a S3 bucket. At the end of the training job, sagemaker automatically upload everything (with everything I mean everything that is contained in /opt/ml/model, if you use a predefined container this is automatic, otherwise it's up to you to write you artifact there) in a compressed archive to that output directory. So simply download the archive from S3.

Automate uploading files stored locally to Cloud Storage using gsutil

I'm new to GCP, I'm trying to build an ETL stream that will upload data from files to BigQuery. It seems to me that the best solution would be to use gsutil. The steps I see today are:
(done) Downloading the .zip file from the SFTP server to the virtual machine
(done) Unpacking the file
Uploading files from VM to Cloud Storage
(done) Automatically upload files from Cloud Storage to BigQuery
Steps 1 and 2 would be performed according to the schedule, but I would like step 3 to be event driven. So when files are copied to a specific folder, gsutil will send them to the specified bucket in Cloud Storage. Any ideas how can this be done?
Assuming you're running on a Linux VM, you might want to check out inotifywait, as mentioned in this question -- you can run this as a background process to try it out, e.g. bash /path/to/my/inotify/script.sh &, and then set it up as a daemon once you've tested it out and got something working to your liking.

Creating DCOS service with artifacts from hdfs

I'm trying to create DCOS services that download artifacts(custom config files etc.) from hdfs. I was using simple ftp server before for it but I wanted to use hdfs. It is allowed to use "hdfs://" in artifact uri but it doesn't work correctly.
Artifact fetch ends with error because there's no "hadoop" command. Weird. I read that I need to provide own hadoop for it.
So I downloaded hadoop, set up necessary variables in /etc/profile. I can run "hadoop" without any problem when ssh'ing to node but service still ends with the same error.
It seems that environment variables configured in service are used after the artifact fetch because they don't work at all. Also, it looks like services completely ignore /etc/profile file.
So my question is: how do I set up everything so my service can fetch artifacts stored on hdfs?
The Mesos fetcher supports local Hadoop clients, please check your agent configuration and in particular your --hadoop_home setting.