is Impala a columnnar clustered database? - hdfs

I am new to Big data and related tools/technologies. I was going through the documentation of impala.
Is it true to say that Impala is a clustered columnar database?
and Impala needs heavy memory to compute/transform the data?

Impala is not a Database.
Impala is a MPP (Massive Parallel Processing) SQL query Engine. It is an interface of SQL on top of HDFS structure. You can build a file structure over Parquet files, that are columnar files that allow you fast read of data.
According Impala documentation:
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.
Impala uses Hive Metastore to store the file structure and Schema of each file. Impala allows you to run SQLs queries in your files, and it will be responsible to parallelize the data in your cluster.
About the uses of Memory, you are partially right. Impala uses memory bound for execution, Hive uses disk based in classical map reduce over Tez execution. In newer version of Impala this allow you to use Disk Spill, that will help you with data that doesn't fit your memory.

Impala integrates with the Apache Hive meta store database, to share databases and tables between both components. The high level of integration with Hive, and compatibility with the HiveQL syntax, lets you use either Impala or Hive to create tables, issue queries, load data, and so on.
Impala is not database.
Impala is not based on Map-Reduce algorithms. It implements a distributed architecture based on daemon processes that are responsible for all the aspects of query execution that run on the same machines.

Related

Ingest RDBMS data to BigQuery

If we have an on-prem sources like SQL-Server and Oracle. Data from it has to be ingested periodically in batch mode in Big Query. What shud be the architecture? Which GCP native services can be used for this? Can Dataflow or DataProc be used?
PS: Our organization haven't licensed any third-party ETL tool so far. Preference is for google native service. Data Fusion is very expensive.
There are two approaches you can take with Apache Beam.
Periodically run a Beam/Dataflow batch job on your database. You could use Beam's JdbcIO connector to read data. After that you can transform your data using Beam transforms (PTransforms) and write to the destination using a Beam sink. In this approach, you are responsible for handling duplicate data (for example, by providing different SQL queries across executions).
Use a Beam/Dataflow pipeline that can read change streams from a database. The simplest approach here might be using one of the available Dataflow templates. For example, see here. You can also develop your own pipeline using Beam's DebeziumIO connector.

Amazon Redshift Framework (Oracle Data Warehouse Migration)

We are currently planning to migrate a 50 TB Oracle data warehouse to Amazon Redshift.
Data from different OLTP data sources were staged first in an Oracle staging database and then loaded into the Data Warehouse currently. Currently data has been transformed using tons of PL/SQL stored procedures within staging database as well as loading into the Data Warehouse.
OLTP Data Source 1 --> JMS (MQ) Real-time --> Oracle STG Database --> Oracle DW
Note: JMS MQ consumer writes data into staging database
OLTP Data Source 2 --> CDC Incremental Data (once in 10 mins) --> Oracle STG Database --> Oracle DW
Note: Change Data Capture on the source side data gets loaded into staging database once in 10 mins.
What would be the better framework to migrate this stack entirely (highlighted) to Amazon Redshift? What are the different components within AWS we can migrate to?
Wow, sounds like a big piece of work. There are quite a few things going on here that all need to be considered.
Your best starting point is probably AWS Database Migration Service (https://aws.amazon.com/dms/). This can do a lot of work for you in regards to converting your schemas and highlighting areas that you will have to migrate manually.
You should consider S3 to be your primary staging area. You need to land all (or almost all) the data in S3 before loading to Redshift. Give very careful consideration to how the data is laid out. In particular, I recommend that you use partitioning prefixes (s3://my_bucket/YYYYMMDDHHMI/files or s3://my_bucket/year=YYYY/month=MM/day=DD/hour=HH/minute=MI/files).
Your PL/SQL logic will not be portable to Redshift. You'll need to convert the non-SQL parts to either bash or Python and use an external tool to run the SQL parts in Redshift. I'd suggest that you start with Apache Airflow (Python) or Azkaban (bash). If you want to stay pure AWS then you can try Data Pipeline (not recommended) or wait for AWS Glue to be released (looks promising - untested).
You may be able to use Amazon Kinesis Firehose for the work that's currently done by JMS but the ideal use of Kinesis is quite different from the typical use of JMS (AFAICT).
Good luck

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-
ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine
ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.
Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).
NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.
Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.
Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

Is HDFS required to use with MapReduce?

We're exploring the use of MR to parallelize long-running processes. All of our data currently resides in RDBMS. We understand that HDFS is the underlying file-based data storage for MR, but were not sure of the following:
Do we have to move all RDBMS data to HDFS to use MR?
Is such a move permanent or temporary only for the life of the MR process?
Can we use MR for its parallel features while jobs still acces data from traditional sources (not HDFS)
I don't think you have to move all RDBMS data to HDFS to use MR. Let's take a look at how Sqoop load data from RDBMS to HBase/HDFS.
Sqoop will load data by MapReduce with the help of [DBInputFormat]1 (which is a connector that allows Hadoop MapReduce programs to read rows from SQL databases).
If performance & scalability is your first priority, yes, you have to
move all your data from RDBMS to HDFS for efficient processing.
MR jobs process data from in and out of HDFS. After the data is
processed you can import the data from HDFS by MR or just using HDFS
apis to other sources.
No, you cannot use MR for its parallel features while jobs still
access data from traditional sources. MR jobs splits the input data
and pass on the it to various maps. With the traditional sources it
wont be possible.

Which has better performance with map reduce- Hbase or Cassandra?

I have choice of using Hbase or cassandra. I will be writing map-reduce tasks to process data.
So which will be better choice Hbase or cassandra. And which will be better to use with hive and pig?
I have used both. I am not sure what #Tariq means by modified without cluster restart as I don't restart the cluster when I modify cassandra schemas. I have not used Pig and Hive but from what I understand those just sit on map/reduce and I have used the map/reduce cassandra adapter which works great. We also know people who have used PlayOrm with map/reduce a bit as well and PlayOrm as of yet does not have the hbase provider written. They have cassandra and mongodb right now so you can write your one client and it works on either database. Of course for specific features of each nosql store, you can get the driver and talk directly to the nosql store instead of going through playOrm but many features are very similar between nosql stores.
I would suggest HBase as it has got native MR support and runs on top of your existing Hadoop cluster seamlessly. Also, simpler schema that can be modified without a cluster restart is a big plus. It also provides easy integration with Pig and Hive as well.