Which has better performance with map reduce- Hbase or Cassandra? - mapreduce

I have choice of using Hbase or cassandra. I will be writing map-reduce tasks to process data.
So which will be better choice Hbase or cassandra. And which will be better to use with hive and pig?

I have used both. I am not sure what #Tariq means by modified without cluster restart as I don't restart the cluster when I modify cassandra schemas. I have not used Pig and Hive but from what I understand those just sit on map/reduce and I have used the map/reduce cassandra adapter which works great. We also know people who have used PlayOrm with map/reduce a bit as well and PlayOrm as of yet does not have the hbase provider written. They have cassandra and mongodb right now so you can write your one client and it works on either database. Of course for specific features of each nosql store, you can get the driver and talk directly to the nosql store instead of going through playOrm but many features are very similar between nosql stores.

I would suggest HBase as it has got native MR support and runs on top of your existing Hadoop cluster seamlessly. Also, simpler schema that can be modified without a cluster restart is a big plus. It also provides easy integration with Pig and Hive as well.

Related

Sharding existing postgresql database with PostgresXL

We want to shard our PostgreSQL DB, due to high disk load. Firstly, we looked at django-sharding library, but:
Very much rewriting in our backend
Migrating all tables to 64-bit primary keys is hard work on 300-400gb tables
Generating ids with Postgres Specific algorithm makes it impossible to move data from shard to shard. More than that, we have a large database with old ids. Updating all of them is a big problem too.
Generating ids with special tables makes us do a special SELECT query to main database every time we insert data. We have high write load, so it's not good.
Considring all these, we decided too look on Postgres database sharding solutions. We found 2 opportunities - Citus and PostgresXL. Citus makes us change data format too much and rewrite a big bunch of backend at the same time, so we are about to try PostgresXL as more transparent solution. But reading the docs, I can't understand some things and will be greatfull for recomendations:
Are there any other sharding workarounds except for Citus and PostgresXL? It would be good not to change much in our database on migrating.
Some questions about PostgresXL:
Do I understand correctly, that it's not Postgres extension, it's a standalone fork? So I should build all its parts from sources and than move data in some way?
How are Postgres and PostgresXL versions compatible? We have PostgreSQL 9.4. I don't see such a version in PostgresXL (9.2 or 9.5 no middle?). So can I use, for example, streaming replication for migration?
If yes/no, what is the best solution to migrate data? If I have 2Tb database with heavy write, can I migrate it somehow without stopping for a long period of time?
Thanks.
First off to save your self a LOT of headache have you looked at options Like Amazon's Auora, Dynomo, Red Shift, etc services? They are VERY cost effective at scale, as well as optimized and managed for you.
Actually Amazon's straight Postgress databases can handle MASSIVE amounts of reads or writes. We can go into 2,000- 6,000 IOPS on reads and another 2,000 to 6,000 IOPS in writes without issue. I would really look into this as the option. Azure, Oracle, and Google also have competing services.
Also be aware that Postgres-XL beyond all reason has no HA support. If you lose a single node you lose everything. The nodes can not fail over.
it's a standalone fork?
Yes, They are very different apps and developed separate from each other.
How are Postgres and PostgresXL versions compatible?
They arn't compatible. You can not just migration Postgres to Postgresl-XL. They work VERY differently.
Generating ids with Postgres Specific algorithm makes it impossible to >move data from shard to shard
Not following this, but with sharing you are not supposed to move data from one shard to another. The key being used generally needs to be something specific and unique to split/segregate your data on. Like a date, or a "type" field, or some other (hopefully ordered) field(s)/column(s). This breaks things up but has obvious pain in the a$$ limitations.
Are there any other sharding workarounds except for Citus and
PostgresXL? It would be good not to change much in our database on >>migrating.
Tons of options, but right off the bat going from a standard RDS, to a NoSql, or MPP database is going to be a major migration, a lot of effort, and have a LOT of limitations no matter what you do.
Next Postress-XL and Citus are MPP (massive parallel processing) clustering apps, not sharing specifically. That is part of what they can do, but it is not their focus.
Other options for MPP
pgPool -- (not great for heavy writes )
haProxy -- ( have not done it but read about it. Lost of work to setup and maintain. )
MySql Cluster -- (Huge pain to use the OSS version and major $$$ for the commercial version)
Green Plumb
Teradata
Vertica
what is the best solution to migrate data?
Very unlikely to find a simple migration for this kind of switch. You can expect to likely need to export the data your self from the existing RDS and import it to the new DB and will likely have to write something your self to get it the way you want it.

Web services over HBase

I am new to the Hadoop environment, sorry if the question is obvious...
I need to develop a web service to record and read large volumes of data. Because of this requirement I thought of using a Hadoop cluster and HBase as my database.
I have designed my hbase schema to satisfy my requirements, so far so good.
The thing is that since it is a service I am developing, I would like the users of the service not to know the internal representation of the data.
I do not want the users to have to invoke a Put to a certain table, for example, to the Clients table, but instead invoke a high-abstraction method, for example, createClient().
How do I add this abstraction layer on top of HBase while maintaining the characteristics of reliable and distributed and the capacity to service lots of users simultaneously offered by HBase itself?
Thanks a lot
Consider Hbase Stargate to enable a REST server. If you want to obscure the table name in the URI, perhaps proxy Stargate with a web server.

When should we go for Apache Spark

Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-
ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine
ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.
Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).
NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.
Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.
Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.

Spark : How to design streaming with spark?

Current architecture
MySQL database with a REST API abstraction.
The problem
MySQL not scaling for various reasons that including data model design which is hard to fix.
Proposed architecture
Using Cassandra as the NoSQL backend and using Spark as the in memory computation engine along with Spark streaming.
Questions
How good is cassandra's consistency ? The decision is to directly use kafka streams that carries real time information into Cassandra and then use Spark SQL to query that data.
If the consistency above is good, then how are RDD's designed around this since they are immutable.Do they create new RDD's ?
An alternative design is to migrate all the data from MySQL to Cassandra and then use kafka to directly send the messages to spark which handles it in real time and use downstream systems to finally hand over the data to Cassandra over time.
In points 1&2 the consistency is dependant upon Cassandra and in point 3 it is tied with Spark.
Which design is better ? Can someone throw some light on this.

Hadoop and Django, is it possible?

From what I understood, Hadoop is a distributed storage system thingy. However what I don't really get is, can we replace normal RDBMS(MySQL, Postgresql, Oracle) with Hadoop? Or is Hadoop is just another type of filesystem and we CAN run RDBMS on it?
Also, can Django integrated with Hadoop? Usually, how web frameworks (ASP.NET, PHP, Java(JSP,JSF, etc) ) integrate themselves with Hadoop?
I am a bit confused with the Hadoop vs RDBMS and I would appreciate any explanation.
(Sorry, I read the documentation many times, but maybe due to my lack of knowledge in English, I find the documentation is a bit confusing most of the time)
What is Hadoop?
Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.
Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).
If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)
HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.
When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.
The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.
And if one of the nodes dies: there is another one with the same data.
HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.
As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )
Can Django integrated with Hadoop?
For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.
For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort.
I've found the following: Hadoopy and Python in Mappers and Reducers.
Hue, The Web UI for Hadoop is based on Django!
Django can connect with most RDMS, so you can use it with a Hadoop based solution.
Keep in mind, Hadoop is many things, so specifically, you want something with low latency such as HBase, don't try to use it with Hive or Impala.
Python has a thrift based binding, happybase, that let you query Hbase.
Basic (!) example of Django integration with Hadoop
[REMOVED LINK]
I use Oozie REST api for job execution, and 'hadoop cat' for grabbing job results (due to HDFS' distributed nature). The better appoach is to use something like Hoop for getting HDFS data. Anyway, this is not a simple solution.
P.S. I've refactored this code and placed it into https://github.com/Obie-Wan/django_hadoop.
Now it's a separate django app.