what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils - amazon-web-services

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.

If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

Related

Automating CSV analysis?

My e-commerce company generates lots of CSV data. To track order status, the team must download a number of trackers. Creating a relationship and subsequently analyse,its a time-consuming process. Which AWS low-code solution can be used to automate the workflow?
Depending on what 'workflow' you require, a few options are:
Amazon Honeycode, which is a low-code application builder
You can Filter and retrieving data using Amazon S3 Select, which works on individual CSV files. This can be scripted via the AWS CLI or an AWS SDK
If you want to run SQL and create JOINs between multiple files, then Amazon Athena is fantastic. This, too, can be scripted.
While not exactly low code, AWS Athena uses SQL-like queries to analyze CSV files, among many other formats

What is the best way to transfer BigQuery data to SFTP?

The data is a few hundred Mb up to a few Gb. It could be running some BQ procedures and in the end a select. The values of this need to be transferred as a valid CSV to an SFTP.
Cloud functions could be problematic because of the 9 minute timeout limit and the 2Gb RAM limit.
Is there a serverless solution or do I have to run manual instances?
There are two scenarios I would consider:
Export table with standard BQ options (here) into GCP bucket storage. Then you can pick it up and upload it to SFTP by Cloud Run. There are containers, which are built for this, e.g. this one
Run a pipelining project. Considering you want to use simple export, I would suggest Dataflow. You can write a small Python or Java code to pick up a file and upload it to SFTP. If you would like more complex logic in processing - have a look at Dataproc

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.