Uploading Local Directory to GCS using Airflow - google-cloud-platform

I am trying to use Airflow to Upload a Directory (with parquet files) to GCS.
I tried the FileToGoogleCloudStorageOperator for this purpose.
I tried the following options:
Option 1
src=<Path>/*.parquet
it Errors out: No such file found
Option 2
src=<Path> -> Where path is the directory path
it Errors out by saying that: Is a directory
Questions
Is there anyway, FileToGoogleCloudStorageOperator can scale up to the directory level?
any alternate way of doing the same?

Short Answer: Currently it is not possible. But I will take it as a feature request and try to add this in the upcoming release.
Till then you can just use BashOperator and use gsutil to copy multiple files at the same time.
Another option is to use PythonOperator, list files using os package and loop over them and use the GoogleCloudStorageHook.upload to upload each file.

Related

Where is a sensible place to put kube_config.yaml files on MWAA?

The example code in the MWAA docs for connecting MWAA to EKS has the following:
#use a kube_config stored in s3 dags folder for now
kube_config_path = '/usr/local/airflow/dags/kube_config.yaml'
This doesn't make me think that putting the kube_config.yaml file in the dags/ directory is a sensible long-term solution.
But I can't find any mention in the docs about where would be a sensible place to store this file.
Can anyone link me to a reliable source on this? Or make a sensible suggestion?
From KubernetesPodOperator Airflow documentation:
Users can specify a kubeconfig file using the config_file parameter, otherwise the operator will default to ~/.kube/config.
In a local environment, the kube_config.yaml file can be stored in specific directory reserved for Kubernetes (e.g. .kube, kubeconfig). Reference: KubernetesPodOperator (Airflow).
In the MWAA environment, where DAG files are stored in S3, the kube_config.yaml file can be stored anywhere in the root DAG folder (including any subdirectory in the root DAG folder, e.g. /dags/kube). The location of the file is less important than explicitly excluding it from DAG parsing via the .airflowignore file. Reference: .airflowignore (Airflow).
Example S3 directory layout:
s3://<bucket>/dags/dag_1.py
s3://<bucket>/dags/dag_2.py
s3://<bucket>/dags/kube/kube_config.yaml
s3://<bucket>/dags/operators/operator_1.py
s3://<bucket>/dags/operators/operator_2.py
s3://<bucket>/dags/.airflowignore
Example .airflowignore file:
kube/
operators/

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.
I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.
So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.
try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

Is there a way to see files stored in localstack's mocked S3 environment

I've setup a localstack install based off the article How to fake AWS locally with LocalStack. I've tested copying a file up to the mocked S3 service and it works great.
I started looking for the test file I uploaded. I see there's an encoded version of the file I uploaded inside .localstack/data/s3_api_calls.json, but I can't find it anywhere else.
Given: DATA_DIR=/tmp/localstack/data I was expecting to find it there, but it's not.
It's not critical that I have access to it directly on the file system, but it would be nice.
My question is: Is there anywhere/way to see files that are uploaded to the localstack's mock S3 service?
After the latest update, now we have only one port which is 4566.
Yes, you can see your file.
Open http://localhost:4566/your-funny-bucket-name/you-weird-file-name in chrome.
You should be able to see the content of your file now.
I went back and re-read the original article which states:
"Once we start uploading, we won't see new files appear in this directory. Instead, our uploads will be recorded in this file (s3_api_calls.json) as raw data."
So, it appears there isn't a direct way to see the files.
However, the Commandeer app provides a view into localstack that includes a directory listing of the mocked S3 buckets. There isn't currently a way to see the contents of the files, but the directory structure is enough for what I'm doing. UPDATE: According to #WallMobile it's now possible to see the contents of files too.
You could use the following command
aws --endpoint-url=http://localhost:4572 s3 ls s3:<your-bucket-name>
In order to list the exact folder in s3 bucket you could use this command:
aws --endpoint-url=http://localhost:4566 s3 ls s3://<bucket-name>/<folder-in-bucket>/
Image is saved as base64 encoded string in the file recorded_api_calls.json
I have passed DATA_DIR=/tmp/localstack/data
and the file is saved at /tmp/localstack/data/recorded_api_calls.json
Open the file and copy the data (d) from any API call that looks like this
"a": "s3", "m": "PUT", "p": "/bucket-name/foo.png"
extract this data using bas64 decoding
You can use this script to extract data from localstack s3
https://github.com/nkalra0123/extract-data-from-localstack-s3/blob/main/script.sh
From my understanding, localstack saves the data in memory by default. This is what happens unless you specify a data directory. Obvious, if in memory, you won't see any files anywhere.
To create a data directory you can run a command such as:
mkdir ~/.localstack
Then you have to instruct localstack to save the data at that location. You can do so by adding the DATA_DIR=... path and a volume in your docker-compose.yml file like so:
localstack:
image: localstack/localstack:latest
ports:
- 4566:4566
- 8055:8080
environment:
- SERVICES=s3
- DATA_DIR=/tmp/localstack/data
- DOCKER_HOST=unix:///var/run/docker.sock
volumes:
- /home/alexis/.localstack:/tmp/localstack
Then rebuild and start the docker.
Once the localstack process started, you'll see a JSON database under ~/.localstack/data/....
WARNING: if you are dealing with very large files (Gb), then it is going to be DEAD SLOW. The issue is that all the data is going to be saved inside that one JSON file in base64. In other words, it's going to generate a file much bigger than what you sent and re-reading it is also going to be enormous. It may be possible to fix this issue by setting the legacy storage mechanism to false:
LEGACY_PERSISTENCE=false
I have not tested that flag (yet).

Hive / S3 error: "No FileSystem for scheme: s3"

I am running Hive from a container (this image: https://hub.docker.com/r/bde2020/hive/) in my local computer.
I am trying to create a Hive table stored as a CSV in S3 with the following command:
CREATE EXTERNAL TABLE local_test (name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://mybucket/local_test/';
However, I am getting the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException No FileSystem for scheme: s3)
What is causing it?
Do I need to set up something else?
Note:
I am able to run aws s3 ls mybucket and also to create Hive tables in another directory, like /tmp/.
Problem discussed here.
https://github.com/ramhiser/spark-kubernetes/issues/3
You need to add reference to aws sdk jars to hive library path. That way it can recognize file schemes,
s3, s3n, and s3a
Hope it helps.
EDIT1:
hadoop-aws-2.7.4 has implementations on how to interact with those file systems. Verifying the jar it has all the implementations to handle those schema.
org.apache.hadoop.fs tells hadoop to see which file system implementation it need to look.
Below classes are implamented in those jar,
org.apache.hadoop.fs.[s3|s3a|s3native]
The only thing still missing is, the library is not getting added to hive library path. Is there anyway you can verify that path is added to hive library path?
EDIT2:
Reference to library path setting,
How can I access S3/S3n from a local Hadoop 2.6 installation?

How to configure Apache Flume to not to rename ingested files with .COMPLETE

We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.
So the obvious way for me is to use Apache Flume for this and use Spooling Directory source which will keep looking for new files in landing directory and ingest them in Hive.
We have read-only permissions for S3 bucket and for landing directory in which files will be copied and Flume suffixes ingested files with .COMPLETED suffix. So in our case Flume won't be able to mark completed files because of permission issue.
Now questions are:
What will happen if Flume is not able to add suffix to completed
files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
Whether
Flume will be able to ingest files without marking them with
.COMPLETED?
Is there any other Big Data tool/technology better
suited for this use case?
Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.
check 'fileSuffix', 'deletePolicy' settings.
If it doesnt rename/delete the completed files, it can't figure out which files are already processed.
You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.