How to process large dataset with AWS Batch Transform

How to process large dataset with AWS Batch Transform - amazon-web-services

I have gigabytes of data for which I want to make predictions using AWS SageMaker Endpoint. I have two main issues:
data comes as Excel files and AWS Batch Transform needs it to be in JSON format to be able to process it. Reading Excel just to save it as JSON is redundant and it's a big IO slowdown
Endpoint can only be invoked over HTTP which means a few MB payload limit - chunking into such small pieces slows things down as well
How can I tackle these issues?
Pipe Mode could be a potential solution but from I read it is used for training only. Is it possible to use Pipe Mode for inference to speed things up?

I would recommend to perform some preprocessing and ETL operations either using Glue/EMR and then use mini-batches to send data to Batch transform.
A blog regarding the same can be found here
Thanks,
Raghu

Related

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.

In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

Sanity check on AWS Big Data Architecture

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.

what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.

If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.

Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data

Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

Using Apache Spark to serve real time web services queries

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.
For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?
Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.
Would like to hear from the experts on this board.

If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.
But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.

So you're batch processing the raw data, yes?
The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.
I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem
It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.
But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.
You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.
Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js