AWS: Setting up a kinesis stream from PostgreSQL to Redshift - amazon-web-services

In reference to my previous question, I got my boss to go ahead and let me set up a DMS from my existing postgres to our new redshift db for our analytics team.
The next issue that I am having, and after spending 3 days doing searching on this has provided nothing to help me with this. My boss wants to use Kinesis to pull real-time data from our PG db to our RS db so our analytics team can pull data in real time from it. I'm trying to get this configured and I'm running into nothing but headaches.
I have a Stream set up, Firehose set up to grab from our S3 bucket that I created called "postgres-stream-bucket", but I'm not sure how to get data to dump into it from PG, and then making sure that RS picks everything up and uses it, in real time.
However, if there are better options I would love to hear them, but it is imperative that we have real-time (or as close as possible) translated data.

Amazon Kinesis Firehose is ideal if you have streaming data coming into your systems. It will collect the records, batch them and load them into Redshift. However, it is not an ideal solution for what you have described, where your source is a database rather than random streams of data.
Since you already have the Database Migration Service setup, you can continue to use it for continuous data replication between PostgreSQL and Redshift. This would be the simplest and most effective solution.

Related

What happens to the data when uploading it to gcp bigquery when there is no internet?

I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

Sanity check on AWS Big Data Architecture

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.