Amazon redshift spectrum under the hood? - amazon-redshift-spectrum

I am just curious to know when you run a query in Redshift spectrum what's happening under the hood?
Is it running a spark job? or map-reduce job or presto or something totally different?

I'm not positive but I believe Spectrum was based off Apache Impala. At this point I'd imagine the code is hardly recognizable as Impala though -- Spectrum has been under active development for quite a while now.
I'm guessing Impala based off Spectrum's capabilities/behavior as relates to timestamps in Parquet files.

Related

Why are we seeing spikes in our presto query run times?

we're trying to debug why our presto query run times vary significantly over the day. We see several significant spikes, some during working hours and some outside of working hours. We're using EMR version 5.14 and Presto version 0.194. Our data is stored in S3 using parquet files created by Hive. The below graph shows the run times for the same query over time using the Presto CLI. Any ideas/suggestions on what we should focus on or what could potentially cause these spikes will be much appreciated. Thanks!
Posting this in case anyone else has this issue. We ended up disabling hive statistics in hive.properties and that improved performance.

is Impala a columnnar clustered database?

I am new to Big data and related tools/technologies. I was going through the documentation of impala.
Is it true to say that Impala is a clustered columnar database?
and Impala needs heavy memory to compute/transform the data?
Impala is not a Database.
Impala is a MPP (Massive Parallel Processing) SQL query Engine. It is an interface of SQL on top of HDFS structure. You can build a file structure over Parquet files, that are columnar files that allow you fast read of data.
According Impala documentation:
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.
Impala uses Hive Metastore to store the file structure and Schema of each file. Impala allows you to run SQLs queries in your files, and it will be responsible to parallelize the data in your cluster.
About the uses of Memory, you are partially right. Impala uses memory bound for execution, Hive uses disk based in classical map reduce over Tez execution. In newer version of Impala this allow you to use Disk Spill, that will help you with data that doesn't fit your memory.
Impala integrates with the Apache Hive meta store database, to share databases and tables between both components. The high level of integration with Hive, and compatibility with the HiveQL syntax, lets you use either Impala or Hive to create tables, issue queries, load data, and so on.
Impala is not database.
Impala is not based on Map-Reduce algorithms. It implements a distributed architecture based on daemon processes that are responsible for all the aspects of query execution that run on the same machines.

Migrating data from Netezza to Redshift

I am not sure if this is the correct forum, but any help on this would be really appreciated.
Can someone please share some links/references which could help me in analyzing the feasibility of database migration from IBM Netezza to Amazon Redshift?
Kamlesh,
There are a lot of similarities between both technologies: IBM Pure Data/Netezza and AWS Redshift.
Some developers who worked on the first version of Netezza also worked on the first version of ParAccel DB. AWS Redshift utilizes the same core engine as ParAccel DB. ParAccel has been sold and the product has been re-branded as Actian Matrix. Still, the core engine is the same.
Both databases are MPP implementations, with a shared nothing architecture. Both share a PostgreSQL "heritage". AWS Redshift truly is a "columnar" database, while Netezza is not.
There are a few differences in SQL Syntax and also some differences in functionality. There are several features/capabilities that AWS Redshift does not yet have. Some of the most "noteworthy" differences is the fact that Redshift does not support Stored Procs, User Defined Functions or Sequences.
Amazon AWS lists the differences between AWS Redshift and PostgreSQL in this document. While this is not a comparison between Netezza and Redshift it will give you a good idea of "what to expect" in terms of differences since both Netezza and Redshift were both originally based on postgreSQL.

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-
ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine
ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.
Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).
NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.
Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.
Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

Which has better performance with map reduce- Hbase or Cassandra?

I have choice of using Hbase or cassandra. I will be writing map-reduce tasks to process data.
So which will be better choice Hbase or cassandra. And which will be better to use with hive and pig?
I have used both. I am not sure what #Tariq means by modified without cluster restart as I don't restart the cluster when I modify cassandra schemas. I have not used Pig and Hive but from what I understand those just sit on map/reduce and I have used the map/reduce cassandra adapter which works great. We also know people who have used PlayOrm with map/reduce a bit as well and PlayOrm as of yet does not have the hbase provider written. They have cassandra and mongodb right now so you can write your one client and it works on either database. Of course for specific features of each nosql store, you can get the driver and talk directly to the nosql store instead of going through playOrm but many features are very similar between nosql stores.
I would suggest HBase as it has got native MR support and runs on top of your existing Hadoop cluster seamlessly. Also, simpler schema that can be modified without a cluster restart is a big plus. It also provides easy integration with Pig and Hive as well.