Microservice architecture for ETL - web-services

I am redesigning a small monolith ETL software written in Python. I find a microservice architecture suitable as it will give us the flexibility to use different technologies if needed (Python is not the nicest language for enterprise software in my opinion). So if we had three microservices (call them Extract, Transform, Load), we could use Java for Transform microservice in the future.
The problem is, it is not feasible here to pass the result of a service call in an API response (say HTTP). The output from Extract is going to be gigabytes of data.
One idea is to call Extract and have it store the results in a database (which is really what that module is doing in the monolith, so easy to implement). In this case, the service will return only a yes/no response (was the process successful or not).
I was wondering if there were a better way to approach this. What would be a better architecture? Is what I'm proposing reasonable?

If your ETL process works on individual records (some parallelize-able units of computation), then there are a lot of options you could go with, here are a few:
Messaging System-based
You could base your processing around a messaging system, like Apache Kafka. It requires a careful setup and configuration (depending on durability, availability and scalability requirements of your specific use-cases), but may give you a better fit than a relational db.
In this case, the ETL steps would work completely independently, and just consume some topics, produce into some other topics. Those other topics are then picked up by the next step, etc. There would be no direct communication (calls) between the E/T/L steps.
It's a clean and easy to understand solution, with independent components.
Off-the-shelf processing solutions
There are a couple of OTS solutions for data processing/computation and transformation: Apache Flink, Apache Storm, Apache Spark.
Although these solutions would obviously confine you to one particular technology, they may be better than building a similar system from scratch.
Non-persistent
If the actual data is streaming/record-based, and it is not required to persist the results between steps, you could just get away with long-polling the HTTP output of the previous step.
You say it is just too much data, but that data doesn't have to go to the database (if it's not required), and could just go to the next step instead. If the data is produced continuously (not everything in one batch), on the same local network, I don't think this would be a problem.
This would be technically very easy to do, very simple to validate and monitor.

I would suggest you to have a look into the Apache flink, It is very similar to what big sized enterprise apps like informatica, talend and data stage mappings but it process in a smaller scale but repetitively. It actually helps you to compute and transform the stuff on the fly/as they arrive and then store/load into a file/db.
The current infra we have with flink process close 28.5GB per every 4 hours and it just works. In the initial days, we had to run our daily batch and the flink stream to ensure both of them are producing consistent results and eventually most of the streams were left active and the daily batches were retired gradually.
Hope it helps someone.

There's none preventing you to have an SFTP server containing CSV or database storing the results. You can do whatever make senses. Using messaging to pass gigabytes of data, or streaming through HTTP may or may not make senses for your case.

This is an interesting problem. The best solution for this could be Reactive Spring Boot. You can have your Extract service to be as a Reactive Spring Boot app and instead of sending GBs of data, stream the data to the required service.
Now you might be wondering that while streaming, it might hold on the working thread. The answer is NO. IT works at the OS level. It doesn't hold up any request thread to stream the results. That's the beauty of the Reactive Spring Boot.
Go through this and explore
https://spring.io/blog/2016/07/28/reactive-programming-with-spring-5-0-m1

Related

Approach for syncing data from different CRM/MAP tools for different clients

We are bulding a highly efficient marketing automation tool.
What we want is to periodically sync data/modified data from our customer's CRM/MAP tools for salesforce, marketo, hubspot, etc.
While we have the REST APIs and all to do that, the biggest challenge we are facing is on architectural side. We are completely on AWS.
We have used sqs-lambda approach - but maintaing parallel syncing is a challenge. Also, it is difficult to modify changes with this approach as and when the number of customers and their integrations with CRM/MAP tools increase.
We are trying to use Airflow (MWAA) with FARGATE now - But we are facing some issues there as well.
Note:
Some syncs are being processed and its time taking (Hence, Lambda is
not a feasible approach)
Please suggest some good approach to do this efficiently on a daily basis.

How to do large-scale, batch reverse geo-coding?

I have a very large list of lat/lon coordinate pairs (>50 million). I want to attach address information to each one. Most geo/revgeo services have strict call limits. Assuming computing power isn't the issue, how can I accomplish this? Also note that time/speed are not the primary concern.
One place to start might be the
You can get one of the dedicated AWS geocoders for unlimited volumme processing: https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=geocoder
Intro
I have experience working with SmartyStreets's batch processing tool. They don't have call limits (paid version). But, they also don't have a Reverse Geocode API (yet!). Their batch processing is strictly for flexibility and ease-of-use in addition to normal calls. But, I am aware of a couple services that do Reverse Geocoding, and they mention batch processing on their website.
How they work
Batch processing services generally allow you to upload your data, even arbitrarily large files. You probably want to put your data in a CSV file (type of spreadsheet) as latitude and longitude pairs. Then, their servers will process the data and alert you when you can download. It's common practice to charge money for this download, but maybe TAMU's is free?
Suggestions on who to use
Texas A&M Geoservices
MapLarge
Both of these services have demos and developer portals to guide you along if there is something you want to research before using them.
(Full disclosure: I have worked for SmartyStreets.)

Using Apache Spark to serve real time web services queries

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.
For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?
Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.
Would like to hear from the experts on this board.
If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.
But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.
So you're batch processing the raw data, yes?
The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.
I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem
It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.
But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.
You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.
Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.

Webmachine on different machines

I have a webmachine REST API server running on one machine. in anticipation of having more traffic that this machine cannot handle, i would need to expand to other nodes on other cpu`s. is there a way of configuring this?
if not what is the right way of distribution here, would i need to do it manually through OTP, concurrent workers and supervisors? where a worker is spawned and ships the request to the neighboring machine.
It kind of depends on your use case. Best way would be observing where you experience problems, and react accordingly.
You could look at your application as three separate parts. First one would be REST interface, second could be businesses logic (little more later), and third would be the data itself (resources, lets call it data store, but it could be even just another service).
Data
This one is simplest. I assume you are using separate service for this (like Riak cluster), where you could do your scaling separately. One thing you could look into is just making sure connection between Webmachine and your data store can scale enough for your needs.
Interface
If your server just can not handle enough requests, just put another one next to it. You can router to dispatch request to both of therm, ans since they will use same data store, they will stay in "sync".
REST being based on http assumes stateless communication. Meaning, any two requests (form same user or two different ones) don't share any resources and can be handled by different applications (you also don't have to share anything between your Webmachine instances).
Domain logic
In theory you should not have any of this in your REST API server, but still lets discuss this a little bit.
Some of your requests might require little more work than just serving content. You might be doing some computation (like serving statistics that need to be generated). You might be updating some resource, that need to change more that one data in one place (one could think of it as of transaction). It could call for more computational power, or state synchronization, which would make scaling harder.
Way around this would be separating REST from such logic. Especially introducing micro-services, which you could scale up or down independently from Webmachine itself.
In Erlang you could actually introduce separate applications inside Erlang VM. Those again could be scaled up with use of Distributed Erlang (and little more in this topic and pull of workers (like poolboy. I would recommend this approach for start, since it is easiest to implement, and due to async nature of Erlang it could always easily be ported to external micro-service.
PS. System resources
You also should check if your box can handle such traffic. One of most common mistakes is not increasing maximal number of file descriptors in your production. But again, first you should observe such problems, and then react to them. Premature optimization in most cases doesn't pay off.
PS2 What and when
You can monitor our applications and system resources with tools like Exometer or more out-of-the-box WombatOAM.
And you can (should) stress test your application with tools like tsung or basho-bench

Service in front of Hadoop

I would like to expose a web service in front of Hadoop, that is used to forward data to Hadoop ecosystem. I have two branches in Hadoop, slower, that works on whole data periodically, and fast, that does some computation on every input, and stores the data for periodical job. But the user does not see the slower branch, and has a feeling that only the fast job is done, not knowing for the slower job that runs on data aggregated during time.
How to organize my architecture best? I am new to Hadoop architecture, I read about Oozie, and have a feeling that it can help me to some point. But I don't know how to connect the service with Hadoop, how to pass the data through service, since Hadoop works primarily on files, and is distributed system.
Data should get into system in a streaming fashion. There should be "real time" branch, that works with individual values that get into system, and they would also be accumulated for periodic batch processing.
Any help would be great, thanks.
You might want to look into hue . This provides a set of web front-ends: there's one for HDFS (the filesystem) where you can upload files; there are means to track jobs too.
If you aim more regular and automated putting of files into HDFS, please elaborate your question further: where and what is the data initially (logs? db? bunch of gzipped csv-s?), what should trigger retrieval/
One can as well use API-s to deal with the filesystem and to track jobs.
As for what oozie concerns, this is more of an orchestrating tool, use it to organize related jobs into workflows.