How to load data from existing S3 bucket in dataiku? - amazon-web-services

I want to load data from my Amazon S3 bucket into Dataiku to process them. Yet, if Dataiku seems to have a connector with S3 buckets:
I don't know how to add my own S3 connection:
It seems that I also can use the usual APIs to have DSS read the files for you: create a S3 connection in the administration settings. But I don't know where they are.

You can add an S3 connector with custom installation.
Since you are using Dataiku online, you need to login to the Launchpad, add a feature in your space and define S3 connection from there:

Related

Accessing Amazon S3 via FTP?

I have did a number of searches and can't seem to understand if this is doable at all.
I have a data logger that has FTP-push function. The FTP-push function have the following settings:
FTP server
Port
Upload directory
User name
Password
In general, I understand that a Filezilla client (I have a Pro edition) is able to drop files into my AWS S3 bucket and I had done this successfully in my local PC.
Is it possible to remove the Filezilla client requirement and input my S3 information directly into my data logger? Something like the below diagram:
Data logger ----FTP----> S3 bucket
If not, what will be the most sensible method to have my data logger JSON files drop into AWS S3 via FTP?
Frankly, you'd be better off with:
Logging to local files
Using a schedule to copy the log files to Amazon S3 using the aws s3 sync command
The schedule could be triggered by cron (Linux) or a Scheduled Task (Windows).
Amazon did add support recently to AWS Transfer for FTP support. This will provide an integration with Amazon S3 via FTP without setting up any additional infrastructure, however you should review the pricing at the moment.
As an alternative you could create an intermediary server that can sync between itself and AWS S3 using the cli aws s3 sync.

How to transfer a file from S3 to someones SFTP server

I have a workflow need. I have a customer that does not want to deal with our S3 folders where we drop their files. They want us to send the files directly to their SFTP account. When I unload files from my backend they automatically unload to S3 from AWS services. As this is a one time request per customer I don't wish to set up an automated transfer protocol in a Lamda or bash script. nor do I wish to go through the hassle of copying the file to my local server only to post it to the SFTP site. I would prefer to just right click on the file and select to transfer to SFTP location. Does anyone know if AWS has any plans to add file transfer protocol support into the S3 console UI? (SFTP, FTP, etc.)
What would be even better is if AWS S3 allowed all files dropped in an S3 bucket location to be automatically transferred to the SFTP location defined -- in the scenario where the customer never wishes to deal with S3, but we need to use it.
Given the current capabilities of Amazon S3, automating a send of files from Amazon S3 to an SFTP target would require the use of an AWS Lambda function.
There are a few ways to do this, since you are looking for the most easiest way i would suggest you to install s3fuse on a linux server, this enables you to mount s3 as a file system. You can directly mount it on the sftp server and copy them locally , below is the URL for s3Fuse.
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
The other method would be to use the AWS CLI to do recursive copy , this would involve installing AWS CLI and generate API keys. Below is an example of the command.
aws s3 cp s3://mybucket/test.txt test2.txt
You can revoke the API keys once you are done with the transfer!

Simplest way to fetch the file from FTP server (on-prem) & put into S3 bucket

As per my project requirement, I want to fetch some files from on-prem FTP server & put them into a S3 bucket. Files are of size 1-2 GB. Once the file will be put into the FTP server folder, I want that file to be uploaded to S3 bucket.
Please suggest the easiest way to achieve this?
Note- Mostly the files will be put into FTP server only once in a day, hence i dont want continuously scan the FTP server. once the files will be uploaded to S3 from FTP server, i want to terminate any resources (like EC2) created in AWS.
These are my ideas:
I think you could create an agent on your FTP server that will upload the files every N seconds/minutes/hours/Etc using the AWS CLI. This way you're avoiding external access to your FTP server.
Another approach is a Lambda function for pulling process, but like you said the FTP server doesn't allow external access.
Create a VPN between your on-prem and the cloud infra, create a Cloudwatch event and through a Lambda execute the pulling process.
Here you can configure a timeout:
Create a VPN between your on-prem and the cloud infra, from your FTP server upload the files using AWS CLI (pay attention to sync option). Take a look at this link: https://aws.amazon.com/answers/networking/accessing-vpc-endpoints-from-remote-networks/
With Jenkins create a task to execute a process that will upload the files.
You can use Storage gateway, visit its site here: https://aws.amazon.com/es/storagegateway/
Here is how we solved it.
Enable S3 acceleration on your S3 bucket. This is very much needed, since you are pushing large file.
If you have access to the server install aws cli and perform a sync on the folder to s3 bucket. AWS CLI will automatically sync your folder to bucket. This way if you change any of your existing files, it will keep in sync with S3 bucket. This is ideal and simplest way if you have access to the server and able to install aws cli.
https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration-examples.html#transfer-acceleration-examples-aws-cli
aws s3api put-bucket-accelerate-configuration --bucket bucketname --accelerate-configuration Status=Enabled
If you want to enable for specific or default profile,
aws configure set default.s3.use_accelerate_endpoint true
If you don't have access to ftp server in your premisis, you need an external server to perform this process. In this case you need to perform a poll or share file system, copy the file locally and move it to s3 bucket. There will be lot of failure points with this process.
Hope it helps.

Using Amazon EBS like S3

Is it possible to use EBS like S3? By that I mean can you allow users to download files from a link like you can in S3?
The reason for this is because my videos NEED to be on the same domain/server to work correctly. I am creating a Virtual Reality video website however, IOS does not support cross-origin resource sharing through WebGL (which is used to create VR).
Because of this, my S3 bucket file system will not work as it will be classed as cross origin, but looking into EBS briefly it seems that it attaches to the all your instances as local storage which would get past the cross-origin problem I am facing.
Would it be simply like a folder on my web server, that could be reached by 'www.domain.com/ebs-file-system/videos/video.mp4'?
Thanks in advance for any comments.
Amazon S3 CORS
You can configure your Amazon S3 bucket to support Cross-Origin Resource Sharing (CORS):
Cross-origin resource sharing (CORS) defines a way for client web applications that are loaded in one domain to interact with resources in a different domain. With CORS support in Amazon S3, you can build rich client-side web applications with Amazon S3 and selectively allow cross-origin access to your Amazon S3 resources.
CloudFront Behaviours
Another option is to use Amazon CloudFront, which can present multiple systems as a single URL. For example, example.com/video could point to an S3 bucket, while example.com/stream could point to a web server. This should circumvent CORS problems.
See:
Format of URLs for CloudFront Objects
Values that You Specify When You Create or Update a Web Distribution
Worst Case
Worst case, you could serve everything via your EC2 instance. You could copy your S3 content to the instance (eg using the AWS Command-Line Interface (CLI) aws s3 sync command) and serve it to your users. However, this negates all the benefits that Amazon S3 provides.

Dynamic usage of AWS S3

I trying to explore AWS S3 and I found out that we can store data and have a URL for a file which can be used on a website, but my intention is to store files on S3 and have users of my website post and retrieve files to/from S3 without my intervention. I am trying to have my server and JSP/Servlets pages on EC2 on which Tomcat (and MySQL server) will be running.
Is this possible and if yes, how can i achieve this.
Thanks,
SD
Yes, it's possible. A full answer to this question is tantamount to a consulting gig, but some resources that should get you started:
The S3 API
Elastic Beanstalk for your webtier
Amazon RDS for MySQL