Fault-tolerance in Apache Sqoop - hdfs

I want to run incremental nightly job that extracts 100s of GBs of data from Oracle DataWarehouse into HDFS. After processing, the results (few GBs) needs to be exported back to Oracle.
We are running Hadoop in Amazon AWS, and our Data Warehouse is on premises. The data link between AWS and on premises is 100 mbps and not reliable.
If I use Sqoop-import to bring the data from Oracle, and the network experience intermittent outages, how does Sqoop handle this?
Also, what happens if I imported (or exported) 70% of my data, and during the remaining 30%, the network goes down?
Since by default Sqoop uses JDBC, how does the data transfer happen at a network level? Can we compress the data in transit?

Related

Using AWS Redshift for building a Multi-tenant SaaS application

We're building a multi-tenant SaaS application hosted on AWS that exposes and visualizes data in the front end via a REST api.
Now, for storage we're considering using AWS Redshift (Cluster or Serverless?) and then exposing the data using API Gateway and Lambda with the Redshift Data API.
The reason why I'm inclined to using Redshift as opposed to e.g RDS is that it seems like a nice option to also be able to conduct data experiments internally when building our product.
My question is, would this be considered a good strategy?
Redshift is sized for very large data and tables. For example the minimum storage size is 1MB. That's 1MB for every column and across all the slices (minimum 2). A table with 5 columns and just a few rows will take 26MB on the smallest Redshift cluster size (default distribution style). Redshift shines when your tables have 10s of millions of rows minimum. It isn't clear from your case that you will have the data sizes that will run efficiently on Redshift.
The next concern would be about your workload. Redshift is a powerful analytics engine but is not designed for OLTP workloads. High volumes of small writes will not perform well; it wants batch writes. High concurrency of light reads will not work as well as a database designed for that workload.
At low levels of work Redshift can do these things - it is a database. But if you use it in a way it isn't optimized for it likely isn't the most cost effective option and won't scale well. If job A is the SAS workload and analytics is job B, then choose the right database for job A. If this choice cannot do job B at the performance level you need then add an analytics engine to the mix.
My $.02 and I'm the Redshift guy. If my assumptions about your workload are wrong please update with specific info.

Fastest way to Import Data from AWS Redshift to BI Tool

I have a table in AWS redshift running ra3.xlplus with 2 nodes which has 15 million rows. I am retrieving data on-premise at the office. I am trying to load that data into Memory in a BI tool. It takes a lot of time (12 minutes) to import that data over using a JDBC connection. Also tried on ODBC connection got same result. I tried to spin up a EC2 with a 25 gigabit connection on AWS, but got the same results.
For comparison loading that data in CSV format takes about 90 seconds.
Are there any solutions as to speed up data transfer.
There are ways to improve this but the true limiter needs to be identified. The likely the bottleneck is the network bandwidth between AWS and your on-prem system. As you are bringing a large amount of data down from the cloud you will want an efficient process for this transport.
JDBC and ODBC are not network efficient as you are seeing. The first thing that will help in moving the data is compression. The second is parallel transfer since there is a fair amount of handshaking in TCP protocol and there is more usable bandwidth than one connection can consume. So how I have done this in the past is to UNLOAD the data compressed to S3, then parallel copy the files from S3 to the local machine piping the files through decompress and saving them. Lastly these files are loaded into your BI tool.
Clearly setting this up takes some time so you want to be sure that the process will be used enough to justify this effort. Another way to go is to bring your BI tool closer to Redshift by locating it in an ec2 instance. The shorter network distance and higher bandwidth should bring down the transfer time significantly. A downside of locating your database in the cloud is that it is in the cloud and not on-prem.

Stream many tables in realtime to Redshift: Bottleneck parallel concurrent load

We recently released an open source project to stream data to Redshift in near realtime.
Github: https://github.com/practo/tipoca-stream
The realtime data pipeline stream data to Redshift from RDS.
Debezium writes the RDS events to Kafka.
We wrote Redshiftsink to sink data from Kafka to Redshift.
We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.
Problem
Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?
Workaround that works for us!
We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.
Looking for suggestions from the Redshift community!

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

Comparing aws [Athena,S3, Lambda ...] VS Hortonwork [HDFS, Hive, Oozie ...]

What are the advantages/disadvantages of using 'plain' Hadoop cluster Hortonworks with components HDFS, Hive, Oozie... vs some services on AWS like S3/Athena/Lambda?
my scenario data flow:
source data come from iot sensors in order to analytics and sometimes I need to query by deviceid & datetime with Hive/Athena ... (all conditions have been partitioned)
Disadvantages of installing Hadoop yourself in any cloud provider is obviously cost and a little bit of maintenance.
For example, HDFS disk gets full, add more volumes. You need to upgrade and patch software yourself. You're charged every machine hour, for every machine and turning off just the namenode of the cluster will render it unusable for a period of time; if you do not have any business use-case for running the cluster overnight, you're wasting money
Therefore the advantage of storing data in cloud is.
While slower than HDFS, object store in S3 is significantly cheaper and scalable
Triggering actions via Lambda or another scheduler, can actually happen faster than Oozie launching a YARN job. Your code isn't tied to Hadoop, either, so your functions should be able to be smaller, although you may be limited in language options. If you combine lambda or other filesystem triggers with container schedulers like Kubernetes, you can open lots of options.
Querying your data any time you want with tools like AWS Glue and Athena, decouples the maintenance of a Hive metastore and a compatible query engine, whether that's Hive, Presto, Impala, Drill, etc. Anyone with AWS access can run an Athena query without needing to know an address of your HiveServer and how to appropriately connect to it (for example, you should secure it and make it highly available)