I'm wondering if there is a way to load data sets I have on my AWS - S3 account into Stata directly. I found that in R there is the AWS.tools package, but I have not found something similar for Stata.
Is there an .ado or something I can use?
You can install the AWS command line interface and shell out from Stata to use it to transfer files to and from Amazon S3. Not very convenient, but workable.
The alternative is to run R from within Stata, which seems clunkier.
Related
I have relatively simple task to do but struggle with best AWS service mix to accomplish that:
I have simple java program (provided by 3rd party- I can't modify that, just use) that I can run anywhere with java -jar --target-location "path on local disc". The program, once executed, is creating csv file on local disc in path defied in --target-location
Once file is created I need to upload it to S3
The way I'm doing it currently is by having dedicated EC2 instance with java installed and first point is covered by java -jar ... and second with aws s3 cp ... command
I'm looking for better way of doing that (preferably serverless). I'm wandering if above points can be accomplished with AWS Glue Job type Python Shell? Second point (copy local file to S3), most likely I can cover with boto3 but first (java -jar execution)- I'm not sure.
Am I force to use EC2 instance or you see smarter way with AWS Glue?
Or most effective would be to build docker image (that contains this two instructions), register in ECR and run wit AWS Batch?
I'm looking for better way of doing that (preferably serverless).
I cannot tell if a serverless option is better, however, an EC2 instance will do the job just fine. Assume that you have CentOS on your instance, you may do it through
aaPanel GUI
Some useful web panels offer cron scheduled tasks, such as backing up some files from one directory to another S3 directory. I will use aaPanel as an example.
Install aaPanel
Install AWS S3 plugin
Configure the credentials in the plugin.
Cron
Add a scheduled task to back up files from "path on local disc" to AWS S3.
Rclone
A web panel goes beyond the scope of this question. Rclone is another useful tool I use to back up files from local disk to OneDrive, S3, etc.
Installation
curl https://rclone.org/install.sh | sudo bash
Sync
Sync a directory to the remote bucket, deleting any excess files in the bucket.
rclone sync -i /home/local/directory remote:bucket
I am trying to establish a connection from AWS Glue to a remote server via SFTP using Python 3.7. I tried using the pysftp library for this task.
But pysftp uses a library named bcrypt that has python and c code. As of this moment, AWS Glue only supports pure python libraries as mentioned in the documentation (below link).
https://docs.aws.amazon.com/glue/latest/dg/console-custom-created.html
The error I am getting is as below.
ImportError: cannot import name '_bcrypt'
I am stuck here due to a compilation error.
Hence, I tried the JSch java library using Scala. There the compilation is successful, but I get the below exception.
com.jcraft.jsch.JSchException: java.net.UnknownHostException: [Remote Server Hostname]
How can we connect to a remote server via SFTP from AWS Glue? Is it possible?
How can we configure outbound rules (if required) for a Glue job?
I am answering my own question here for anyone whom this might help.
The straight answer is no.
I found the below resources which indicate that AWS Glue is an ETL tool for AWS resources.
AWS Glue uses other AWS services to orchestrate your ETL (extract,
transform, and load) jobs to build a data warehouse.
Source - https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html
Glue works well only with ETL from JDBC and S3 (CSV) data sources. In
case you are looking to load data from other cloud applications, File
Storage Base, etc. Glue would not be able to support.
Source - https://hevodata.com/blog/aws-glue-etl/
Hence to implement what I was working on, I used an AWS Lambda function to connect to the remote server via SFTP, pick the required files and drop them in an S3 bucket. The AWS Glue job can now pick the files from S3.
i know that there is some time since this question was post, so i like to share some tools that could help you to get data from a sftp more easily and quickly. so for get a layer in a easy way use this tool https://github.com/aws-samples/aws-lambda-layer-builder, you can make a layer of pysftp faster and free of those annoying errors (cffi, bycrypt).
The lambda has a limit of 500 MB,so if you are trying to extract heavy files, the lambda will crash for this reason. to fix this you have to attach EFS (Elastic File System) to your lamdba https://docs.aws.amazon.com/lambda/latest/dg/services-efs.html
I am managing a bunch of users using Amazon Workspaces, they have terabytes of data which they want to start playing around with on their workspace.
I am wondering what is the best way to do the data upload process? Can everything just be downloaded from Google Drive or Dropbox? Or should I use something like AWS Snowball, which is specifically for migration?
While something like AWS Snowball is probably the safest, best bet, I'm kind of hesitant to add another AWS product to the mix, which is why I might just have everything be uploaded and then downloaded from Google Drive / Dropbox. Then again, I am building an AWS environment that will be used long term, and long term using Google Drive / Dropbox won't be a solution.
Thoughts to architect this out (short term and long term)?
Why would you be hesitant to include more AWS products in the mix? Generally speaking, if you aren't combining multiple AWS products to build your solutions then you aren't making very good use of AWS.
For the specific task at hand I would look into AWS WorkDocs, which is integrated very well with AWS Workspaces. If that doesn't suit your needs I would suggest placing the data files on Amazon S3.
You can use FileZilla Pro to upload your data to a AWS S3 bucket.
And use FileZilla Pro within the Workspaces instance to download the files.
I have been assigned with automating the task of fetching some files(csv/Excel) from certain websites and then loading them to S3. I would like to know if a script can be written in someway so that if you put the url and the s3 bucket path and run the script, the files would load onto S3; and if yes, how would I go about it?
Open to alternative solutions as well.
Thanks!
You can write a simple bash script, or any other language of your choice. Below are the steps for bash script:
Download the csv file using curl or wget.
Upload using aws-cli tools, aws s3 cp
Make sure you have aws-cli installed on the machine where you will be running this script.
You may want to read more about AWS CLI and AWS CLI S3 commands.
I am new in this forum and technology and looking for your advice. I am working on POC and below are my requirement. Could you please guide me the way to achieve the result.
Copy data from NAS to S3.
Use S3 as a source in EMR Job with target to S3/Redshift.
Any link, pdf will also helpful.
Thanks,
Pardeep
There's a lot here that you're asking and there's not a lot of info on your use case to go by so I'm going to be very general in my answer and hopefully it at least points you in the right direction.
You can use Lambda to copy data from your NAS to S3. Assuming your NAS is on-premise and assuming you have a VPN into your VPC or even Direct Connect configured, then you can use a VPC enabled Lambda function to read from the NAS on-premise and write to S3.
If your NAS is running on EC2 the above will remain the same except there's no need for VPN or Direct Connect.
Are you looking to kick off the EMR job from Lambda? You can use S3 as a source for EMR to then output to S3 either from within Lambda or via other means as well.
If you can provide more info on your use case we could probably give you a better quality answer.
Copy data from NAS to S3.
Really depends on the amount of data and the frequency on which you run the copy job. If the data in GBs, then you can install AWS CLI on a machine where NFS is attached. AWS CLI command like CP can be multithreaded and can easily copy your datasets to S3. You might also enable S3 transfer acceleration to speed things up. Having AWS Direct connect to your company network can also speed up any transfers from on-premis to AWS.
http://docs.aws.amazon.com/cli/latest/topic/s3-config.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
https://aws.amazon.com/directconnect/
If the data is in TBs (which is probably distributed across multiple volumes), then you might have to consider using physical transfer utilities like AWS Snowball,AWSImportExport or AWS Snowmobile based on the use-case.
https://aws.amazon.com/cloud-data-migration/
Use S3 as a source in EMR Job with target to S3/Redshift.
Again, as there are lot of applications on EMR, there are lot of choices. Redshift supports COPY/UNLOAD commands to S3 which any application can make use of. If you want to use SPARK on EMR , then installing databricks spark-redshift driver is a viable option for you.
https://github.com/databricks/spark-redshift
https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/