Copy file from windows remote server to GCS bucket using Airflow - google-cloud-platform

file_path = "\\wfs8-XXXXX\XXXX"
The files is on remote server path, I am using cloud composer to automate my data pipeline,
so How I will be able to copy the files from remote windows server to GCS bucket using composer ?
I tried to use LocalFilesystemToGCSOperator, but I am not able to provide any connection option to connect windows remote server, please advise
upload_file = LocalFilesystemToGCSOperator(
task_id="upload_file",
src=PATH_TO_UPLOAD_FILE,
dst=DESTINATION_FILE_LOCATION,
bucket=BUCKET_NAME,
)

In this case you can use SFTPToGCSOperator operator, example :
copy_file_from_sftp_to_gcs = SFTPToGCSOperator(
task_id="file-copy-sftp-to-gcs",
sftp_conn_id="<your-connection>",
source_path=f"{FILE_LOCAL_PATH}/{OBJECT_SRC_1}",
destination_bucket=BUCKET_NAME,
)
You have to configure the sftp connection in Airflow, you can check this topic to have an example.

Related

WinSCP error while performing directory Sync

I've developed a .Net console application to run as a webjob under Azure App Service.
This console app is using WinSCP to transfer files from App Service Filesystem to an on-prem FTP Server.
The job is failing with below error:
Upload of "D:\ ...\log.txt" failed: WinSCP.SessionRemoteException: Error deleting file 'log.txt'. After resumable file upload the existing destination file must be deleted. If you do not have permissions to delete file destination file, you need to disable resumable file transfers.
Herein the code snippet I use to perform the directory sync (I've disabled deletion):
var syncResult = session.SynchronizeDirectories(SynchronizationMode.Remote, localFolder, remoteFolder, false,false);
Any clues on how to disable resumable file transfers ??
Use TransferOptions.ResumeSupport:
var transferOptions = new TransferOptions();
transferOptions.ResumeSupport.State = TransferResumeSupportState.Off;
var syncResult =
session.SynchronizeDirectories(
SynchronizationMode.Remote, localFolder, remoteFolder, false, false,
transferOptions);

AWS Airflow v2.0.2 doesn't show Google Cloud connection type

I want to load data from Google Storage to S3
To do this I want to use GoogleCloudStorageToS3Operator, which requires gcp_conn_id
So, I need to set up Google Cloud connection type
To do this, I added
apache-airflow[google]==2.0.2
to requirements.txt
but Google Cloud connection type is still not in Dropdown list of connections in MWAA
Same approach works well with mwaa local runner
https://github.com/aws/aws-mwaa-local-runner
I guess it does not work in MWAA because of security reasons discussed here
https://lists.apache.org/thread.html/r67dca5845c48cec4c0b3c34c3584f7c759a0b010172b94d75b3188a3%40%3Cdev.airflow.apache.org%3E
But still, is there any workaround to add Google Cloud connection type in MWAA?
Connections can be created and managed using either the UI or environment variables.
To my understanding the limitation that MWAA have over installation of some provider packages are only on the web server machine which is why the connections are not listed on the UI. This doesn't mean you can't create the connection at all, it just means you can't do it from the UI.
You can define it from CLI:
airflow connections add [-h] [--conn-description CONN_DESCRIPTION]
[--conn-extra CONN_EXTRA] [--conn-host CONN_HOST]
[--conn-login CONN_LOGIN]
[--conn-password CONN_PASSWORD]
[--conn-port CONN_PORT] [--conn-schema CONN_SCHEMA]
[--conn-type CONN_TYPE] [--conn-uri CONN_URI]
conn_id
You can also generate a connection URI to make it easier to set.
Connections can also be set as environment variable. Example:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://?extra__google_cloud_platform__key_path=%2Fkeys%2Fkey.json&extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&extra__google_cloud_platform__project=airflow&extra__google_cloud_platform__num_retries=5'
If needed you can check the google provider package docs to review the configuration options of the connection.
For MWAA there are 2 options to set connection:
Setting environment variable.
Using pattern AIRFLOW_CONN_YOUR_CONNECTION_NAME,
where e.g. YOUR_CONNECTION_NAME = GOOGLE_CLOUD_DEFAULT.
That can be done using custom plugin
https://docs.aws.amazon.com/mwaa/latest/userguide/samples-env-variables.html
Using secret manager
https://docs.aws.amazon.com/mwaa/latest/userguide/connections-secrets-manager.html
Tested for google cloud connection, both are working.
I asked AWS support about this issue. Looks like they are working on it.
They told me a way to configure the the google cloud platform connection passing a json object in the extras with Conn Type as HTTP. And it works.
I have validated editing google_cloud_default (Airflow > Admin > Connections)
Conn Type: HTTP
Extra:
{
"extra__google_cloud_platform__project":"<YOUR_VALUE>",
"extra__google_cloud_platform__key_path":"",
"extra__google_cloud_platform__keyfile_dict":"{"type": "service_account","project_id": "<YOUR_VALUE>","private_key_id": "<YOUR_VALUE>", "private_key": "-----BEGIN PRIVATE KEY-----\n<YOUR_VALUE>\n-----END PRIVATE KEY-----\n", "client_email": "<YOUR_VALUE>", "client_id": "<YOUR_VALUE>", "auth_uri": "https://<YOUR_VALUE>", "token_uri": "https://<YOUR_VALUE>", "auth_provider_x509_cert_url": "https://<YOUR_VALUE>", "client_x509_cert_url": "https://<YOUR_VALUE>"}",
"extra__google_cloud_platform__scope":"",
"extra__google_cloud_platform__num_retries":"5"
}
airflow conn screenshot
!! You must escape the " and /n in extra__google_cloud_platform__keyfile_dict !!
In requirements.txt I used:
apache-airflow[gcp]==2.0.2
(I believe apache-airflow[google]==2.0.2 should work as well)

Moving files directly from S3 to FTP

I am having a media based web application running on AWS (EC2 windows). And I'm trying to achieve scalability by adding the app and web servers on an auto scaling group.
My problem is I need to separate the media storage to S3 so that I can share this with different app server clusters. But I have to move these media files from S3 to different FTP servers. For that I have to download the files from S3 to app server and then do the FTP upload which is taking too much time process. Note that I am using ColdFusion as application server.
Now I have 2 options to solve this
Mount the S3 instance to EC2 instances (I know that is not recommenced, also not sure if that will help to improve the speed of FTP upload).
Use Lamda service to upload files directly from S3 to FTP servers
I can not use separate EBS volume to each of the EC2 instance because
The storage volume is huge and it will result in high cost
I need to sync the media storage on different EBS volumes attached to the EC2 instances
EFS is not an option as I'm using windows storage.
Can any one suggest better solution?
That is pretty easy with python
from ftplib import FTP
from socket import _GLOBAL_DEFAULT_TIMEOUT
import urllib.request
class FtpCopier(FTP):
source_address = None
timeout = _GLOBAL_DEFAULT_TIMEOUT
# host → ftp host name / ip
# user → ftp login user
# password → ftp password
# port → ftp port
# encoding → ftp servver encoding
def __init__(self, host, user, password, port = 21, encoding = 'utf-8'):
self.host = host
self.user = user
self.password = password
self.port = port
self.connect(self.host, self.port)
self.login(self.user, self.password, '')
self.encoding = encoding
# url → any web URL (for example S3)
# to_path → ftp server full path (check if ftp destination folders exists)
# chunk_size_mb → data read chunck size
def transfer(self, url, to_path, chunk_size_mb = 10):
chunk_size_mb = chunk_size_mb * 1048576 # 1024*1024
file_handle = urllib.request.urlopen(url)
self.storbinary("STOR %s" % to_path, file_handle, chunk_size_mb)
Use example:
ftp = FtpCopier("some_host.com", "user", "p#ssw0rd")
ftp.transfer("https://bucket.s3.ap-northeast-2.amazonaws.com/path/file.jpg", "/path/new_file.jpg")
But remember that lambda process time is limited to 15 minutes. So timeout may appear before file transfer completed. I recommend to use ECS Fargate instead lambda. That allows to hold running process as long as you want.
If S3 file is not public, use presigned URLs to access it via urllib.
aws s3 presign s3://bucket/path/file.jpg --expires-in 604800

Does AWS CPP S3 SDK support "Transfer acceleration"

I enabled "Transfer acceleration" on my bucket. But I dont see any improvement in speed of Upload in my C++ application. I have waited for more than 20 minutes that is mentioned in AWS Documentation.
Does the SDK support "Transfer acceleration" by default or is there a run time flag or compiler flag? I did not spot anything in the SDK code.
thanks
Currently, there isn't a configuration option that simply turns on transfer acceleration. You can however, use endpoint override in the client configuration to set the accelerated endpoint.
What I did to enable a (working) transfer acceleration:
set in the bucket configuration on the AWS panel "Transfer Acceleration" to enabled.
add to the IAM user that I use inside my C++ application the permission s3::PutAccelerateConfiguration
Add the following code to the s3 transfer configuration (bucket_ is your bucket name, the final URL must match the one shown in the AWS panel "Transfer Acceleration"):
Aws::Client::ClientConfiguration config;
/* other configuration options */
config.endpointOverride = bucket_ + ".s3-accelerate.amazonaws.com";
Ask for acceleration to the bucket before transfer... (docs in here )
auto s3Client = Aws::MakeShared<Aws::S3::S3Client>("Uploader",
Aws::Auth::AWSCredentials(id_, key_), config);
Aws::S3::Model::PutBucketAccelerateConfigurationRequest bucket_accel;
bucket_accel.SetAccelerateConfiguration(
Aws::S3::Model::AccelerateConfiguration().WithStatus(
Aws::S3::Model::BucketAccelerateStatus::Enabled));
bucket_accel.SetBucket(bucket_);
s3Client->PutBucketAccelerateConfiguration(bucket_accel);
You can check in the detailed logs of the AWS sdk that your code is using the accelerated entrypoint and you can also check that before the transfer start there is a call to /?accelerate (info)
What worked for me:
Enabling S3 Transfer Acceleration within AWS console
When configuring the client, only utilize the accelerated endpoint service:
clientConfig->endpointOverride = "s3-accelerate.amazonaws.com";
#gabry - your solution was extremely close, I think the reason it wasn't working for me was perhaps due to SDK changes since originally posted as the change is relatively small. Or maybe because I am constructing put object templates for requests used with the transfer manager.
Looking through the logs (Debug level) the SDK automatically concatenates the bucket used in transferManager::UploadFile() with the overridden endpoint. I was getting unresolved host errors as the requested host looked like:
[DEBUG] host: myBucket.myBucket.s3-accelerate.amazonaws.com
This way I could still keep the same S3_BUCKET macro name while only selectively calling this when instantiating a new configuration for upload.
e.g.
<<
...
auto putTemplate = new Aws::S3::Model::PutObjectRequest();
putTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->putObjectTemplate = *putTemplate;
auto multiTemplate = new Aws::S3::Model::CreateMultipartUploadRequest();
multiTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->createMultipartUploadTemplate = *multiTemplate;
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
auto transferHandle = transferMgr->UploadFile(localFile, S3_BUCKET, s3File);
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
...
>>

Python Requests Post request fails when connecting to a Kerberized Hadoop cluster with Livy

I'm trying to connect to a kerberized hadoop cluster via Livy to execute Spark code. The requests call im making is as below.
kerberos_auth = HTTPKerberosAuth(mutual_authentication=REQUIRED, force_preemptive=True)
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers, auth=kerberos_auth)
This call fails with the following error
GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos credentails)
Any help here would be appreciated.
When running Hadoop service daemons in Hadoop in secure mode, Kerberos tickets are decrypted with a keytab and the service uses the keytab to determine the credentials of the user coming into the cluster. Without a keytab in place with the right service principal inside of it, you will get this error message. Please refer to Hadoop in Secure Mode for further details on setting up the keytab.