I was wondering if hbase-0.90.0 has any known problems related to multiple clients writing to the same row at the same time. In my setup, there are >10 nodes writing to the same HBase table, and some times (very rarely) I'm seeing data not being written to the table, I log exceptions etc and I see none. One possibility is that multiple nodes are writing to the same row at once, and I was wondering if that could be causing this behavior. Thanks!
What version of Hadoop you are using - some old version did not have durable sync and could lose data:
HBase will lose data unless it is running on an HDFS that has a
durable sync implementation. Hadoop 0.20.2, Hadoop 0.20.203.0, and
Hadoop 0.20.204.0 DO NOT have this attribute. Currently only Hadoop
versions 0.20.205.x or any release in excess of this version -- this
includes hadoop 1.0.0 -- have a working, durable sync [6]. Sync has to
be explicitly enabled by setting dfs.support.append equal to true on
both the client side -- in hbase-site.xml -- and on the serverside in
hdfs-site.xml (The sync facility HBase needs is a subset of the append
code path).
see here for all details
Related
In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.
I would like to expose a web service in front of Hadoop, that is used to forward data to Hadoop ecosystem. I have two branches in Hadoop, slower, that works on whole data periodically, and fast, that does some computation on every input, and stores the data for periodical job. But the user does not see the slower branch, and has a feeling that only the fast job is done, not knowing for the slower job that runs on data aggregated during time.
How to organize my architecture best? I am new to Hadoop architecture, I read about Oozie, and have a feeling that it can help me to some point. But I don't know how to connect the service with Hadoop, how to pass the data through service, since Hadoop works primarily on files, and is distributed system.
Data should get into system in a streaming fashion. There should be "real time" branch, that works with individual values that get into system, and they would also be accumulated for periodic batch processing.
Any help would be great, thanks.
You might want to look into hue . This provides a set of web front-ends: there's one for HDFS (the filesystem) where you can upload files; there are means to track jobs too.
If you aim more regular and automated putting of files into HDFS, please elaborate your question further: where and what is the data initially (logs? db? bunch of gzipped csv-s?), what should trigger retrieval/
One can as well use API-s to deal with the filesystem and to track jobs.
As for what oozie concerns, this is more of an orchestrating tool, use it to organize related jobs into workflows.
I am a new learner of Hadoop.
While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).
Experts comment will help me to understand HDFS in better manner.
HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may support this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access.
http://www.edureka.co/blog/introduction-to-apache-hadoop-hdfs/
There are three major reasons that HDFS has the design it has,
HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only
HDFS was not originally intended for anything but batch computation
Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.
There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.
Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included
MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.
Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.
Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.
Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.
other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.
An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read.
Though this design decision does impose restrictions, HDFS was built keeping in mind efficient streaming data access.
Quoting from Hadoop - The Definitive Guide:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time
to read the whole dataset is more important than the latency in reading the first
record.
From what I understood, Hadoop is a distributed storage system thingy. However what I don't really get is, can we replace normal RDBMS(MySQL, Postgresql, Oracle) with Hadoop? Or is Hadoop is just another type of filesystem and we CAN run RDBMS on it?
Also, can Django integrated with Hadoop? Usually, how web frameworks (ASP.NET, PHP, Java(JSP,JSF, etc) ) integrate themselves with Hadoop?
I am a bit confused with the Hadoop vs RDBMS and I would appreciate any explanation.
(Sorry, I read the documentation many times, but maybe due to my lack of knowledge in English, I find the documentation is a bit confusing most of the time)
What is Hadoop?
Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.
Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).
If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)
HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.
When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.
The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.
And if one of the nodes dies: there is another one with the same data.
HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.
As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )
Can Django integrated with Hadoop?
For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.
For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort.
I've found the following: Hadoopy and Python in Mappers and Reducers.
Hue, The Web UI for Hadoop is based on Django!
Django can connect with most RDMS, so you can use it with a Hadoop based solution.
Keep in mind, Hadoop is many things, so specifically, you want something with low latency such as HBase, don't try to use it with Hive or Impala.
Python has a thrift based binding, happybase, that let you query Hbase.
Basic (!) example of Django integration with Hadoop
[REMOVED LINK]
I use Oozie REST api for job execution, and 'hadoop cat' for grabbing job results (due to HDFS' distributed nature). The better appoach is to use something like Hoop for getting HDFS data. Anyway, this is not a simple solution.
P.S. I've refactored this code and placed it into https://github.com/Obie-Wan/django_hadoop.
Now it's a separate django app.
Is there any way to check changes in database before running synchronize with MS Sync Framework?
I have a database with about 100 tables, 80% of these tables are not changed very often. I divided database into multiple scopes to handle the sync priority. Even though, there's no change in database, It takes a long time to finish synchronization.
i suggest you trace the Sync process to find out what's going on: How to: Trace the Synchronization Process
there is no specific API call in the Sync Framework SDK for simply checking a table has changed. most the API calls will do an actual change enumeration(read: query the base and tracking tables)
if you have large number of rows in your tables, you might want to set a retention period on the Sync Framework metadata to keep it small. see How to: Clean Up Metadata for Collaborative Synchronization (SQL Server)
Yes. Check out the Sync Framework Team Blog on Synchronization Services for ADO .NET for Devices: Improving performance by skipping tables that don’t need synchronization