I am trying to establish a connection from AWS Glue to a remote server via SFTP using Python 3.7. I tried using the pysftp library for this task.
But pysftp uses a library named bcrypt that has python and c code. As of this moment, AWS Glue only supports pure python libraries as mentioned in the documentation (below link).
https://docs.aws.amazon.com/glue/latest/dg/console-custom-created.html
The error I am getting is as below.
ImportError: cannot import name '_bcrypt'
I am stuck here due to a compilation error.
Hence, I tried the JSch java library using Scala. There the compilation is successful, but I get the below exception.
com.jcraft.jsch.JSchException: java.net.UnknownHostException: [Remote Server Hostname]
How can we connect to a remote server via SFTP from AWS Glue? Is it possible?
How can we configure outbound rules (if required) for a Glue job?
I am answering my own question here for anyone whom this might help.
The straight answer is no.
I found the below resources which indicate that AWS Glue is an ETL tool for AWS resources.
AWS Glue uses other AWS services to orchestrate your ETL (extract,
transform, and load) jobs to build a data warehouse.
Source - https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html
Glue works well only with ETL from JDBC and S3 (CSV) data sources. In
case you are looking to load data from other cloud applications, File
Storage Base, etc. Glue would not be able to support.
Source - https://hevodata.com/blog/aws-glue-etl/
Hence to implement what I was working on, I used an AWS Lambda function to connect to the remote server via SFTP, pick the required files and drop them in an S3 bucket. The AWS Glue job can now pick the files from S3.
i know that there is some time since this question was post, so i like to share some tools that could help you to get data from a sftp more easily and quickly. so for get a layer in a easy way use this tool https://github.com/aws-samples/aws-lambda-layer-builder, you can make a layer of pysftp faster and free of those annoying errors (cffi, bycrypt).
The lambda has a limit of 500 MB,so if you are trying to extract heavy files, the lambda will crash for this reason. to fix this you have to attach EFS (Elastic File System) to your lamdba https://docs.aws.amazon.com/lambda/latest/dg/services-efs.html
Related
Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
I would like to avoid AWS dev endpoint. Is there a way where I can test and debug my PySpark code without using AWS dev endpoint with the help of testing my code in local notebook/IDE?
As others have said, it depends on which part of the Glue are you going to use. If your code is based on pure Spark, without the Dynamic Frames etc. Then local version of Spark may suffice, if however you are intending on using Glue extensions, there is not really an option of not using the Dev End point at this stage.
I hope that this helps.
If you are going to deploy your pyspark code on AWS Glue service, you may have to use GlueContext & other AWS Glue APIs. So if you would like to test against AWS Glue service, using these AWS Glue APIs then you have to have an AWS Dev Endpoint.
However having a AWS Glue notebook is optional, since you can setup zeppelin, etc. establish an ssh tunnel connection with AWS Glue DEP for dev / testing from local env. Make sure you delete the DEPoint once your development/testing is done for the day.
Alternately, if you are not keen on using AWS Glue APIs other than GlueContext, then yes, you can setup zeppelin in local environment, test the code locally and then upload your code to S3, create a Glue job for testing in AWS Glue Service
We have a setup here, where we have pyspark install locally and we use VSCode to develop our pyspark codes, unit test, and debug. We run the codes against the local pyspark installation during development, then we deploy those codes to EMR to run with real dataset.
I'm not sure how much of this apply to what you're trying to do with Glue, as it's a level higher in abstraction.
We use pytest to test pyspark code. We keep pyspark code in another file and calls those functions inglue code file. With this separation, we can unit test pyspark code using pytest
I was able to test without dev endpoints
Please follow the instructions here
https://support.wharton.upenn.edu/help/glue-debugging
I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards
I am new in this forum and technology and looking for your advice. I am working on POC and below are my requirement. Could you please guide me the way to achieve the result.
Copy data from NAS to S3.
Use S3 as a source in EMR Job with target to S3/Redshift.
Any link, pdf will also helpful.
Thanks,
Pardeep
There's a lot here that you're asking and there's not a lot of info on your use case to go by so I'm going to be very general in my answer and hopefully it at least points you in the right direction.
You can use Lambda to copy data from your NAS to S3. Assuming your NAS is on-premise and assuming you have a VPN into your VPC or even Direct Connect configured, then you can use a VPC enabled Lambda function to read from the NAS on-premise and write to S3.
If your NAS is running on EC2 the above will remain the same except there's no need for VPN or Direct Connect.
Are you looking to kick off the EMR job from Lambda? You can use S3 as a source for EMR to then output to S3 either from within Lambda or via other means as well.
If you can provide more info on your use case we could probably give you a better quality answer.
Copy data from NAS to S3.
Really depends on the amount of data and the frequency on which you run the copy job. If the data in GBs, then you can install AWS CLI on a machine where NFS is attached. AWS CLI command like CP can be multithreaded and can easily copy your datasets to S3. You might also enable S3 transfer acceleration to speed things up. Having AWS Direct connect to your company network can also speed up any transfers from on-premis to AWS.
http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
https://aws.amazon.com/directconnect/
If the data is in TBs (which is probably distributed across multiple volumes), then you might have to consider using physical transfer utilities like AWS Snowball,AWSImportExport or AWS Snowmobile based on the use-case.
https://aws.amazon.com/cloud-data-migration/
Use S3 as a source in EMR Job with target to S3/Redshift.
Again, as there are lot of applications on EMR, there are lot of choices. Redshift supports COPY/UNLOAD commands to S3 which any application can make use of. If you want to use SPARK on EMR , then installing databricks spark-redshift driver is a viable option for you.
https://github.com/databricks/spark-redshift
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/
I am trying to understand how to deploy an Amazon Kinesis Client application that was built using the Kinesis client library (KCL).
I found this but it only states
You can follow your own best practices for deploying code to an Amazon EC2 instance when you deploy a Amazon Kinesis application. For example, you can add your Amazon Kinesis application to one of your Amazon EC2 AMIs.
which is not giving a broader picture to me.
These examples use an Ant script to run Java program. Is this the best practice to follow?
Also, I understand even before running the EC2 instances I need to make sure
The developed code JAR/WAR or any other format needs to be on the EC2 instance
The EC2 instance needs to have all the required environment like Ant setup in place already to execute the program.
Could someone please add some more detail on this?
Amazon Kinesis will be responsible for ingesting data, not running your application. You can run your application anywhere, but it is a good idea to run it in EC2, as you are probably going to use other AWS Services, such as S3 or DynamoDB (Kinesis Client Library uses DynamoDB for sharding, for example).
To understand Kinesis better, I'd recommend that you launch the Kinesis Data Visualization Sample. When you launch this app, use the provided CloudFormation template. It will create a stack with the Kinesis stream and an EC2 instance with the application, that uses Kinesis Client Library and is a fully working example to start from.
The best way I have found to host a consumer program is using EMR, but not as a hadoop cluster. Package your program as a jar, and place it in s3. Launch an emr cluster and have it run your jar. Using the data pipeline you can schedule this job flow to run at regular intervals. You can also scale an emr cluster, or use a actual EMR job to process the stream if you choose to get the high tech.
You can also use Beanstalk. I believe this article is highly useful.