which is one is faster Hbase or Hypertable? [closed] - c++

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I had a requirement to store millions of records in which all are unique with multiple columns.
for example
eventcode description count
526 blocked 100
5230 xxx 20
....
and I want the following requirements while fetching sorting on count column, filtering on columns.
So I thought of using Hbase but I googled up and known that hypertable is faster.
So I am bit confused to know it.
please help me regarding this.
Note: I want to use C++ for transactions (reading, writing).

BIG disclaimer: i work for hypertable.
We have created a benchmark a while ago which you can read here: http://hypertable.com//why_hypertable/hypertable_vs_hbase_2/
Conclusion: Hypertable is faster, usually twice as fast.
Performance actually was the reason why hypertable was founded. Back then some guys were sitting together and discussing an open source implementation of Google's bigtable architecture. They did not agree on the programming language (java vs. c++ - the disagreement was about performance). As a result, one group founded hypertable (a C++ implementation) and the other group started working on hbase (in java).
If you do not trust benchmarks then you will have to run your own; both systems are open source and free to use. If you have questions about hypertable or run into problems while evaluating it then feel free to drop me a mail (or use the mailing list - all questions are getting answered.)
Btw - hypertable does not (yet) support sorting. You will have to implement this in your client application.

Related

Logging framework for C++ [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I apologize to take a topic which is widely discussed before - but I find none of the discussions clearly tell which one to use ultimately. My requirements for a logging framework in my C++ project are
Thread safe.
Should support multiple targets.
Log rotation possible.
A way to identify module's implicitly.
I have been using boost log for some time in a small c++ project and it worked well. But when I took to a large C++ project - I found supporting multiple targets(I mean multiple files for the same project) is a nightmare, No way to implicitly mention which module is logging and above all the compile time has increased at-least 40%.
Now I am looking at alternate framework and think log4cplus and logog seems fill all my requirements. Wanted to get an expert opinion on which would suit the above criteria rather than getting in a soup again after using the library for some time.

Developing a data parser, not sure where to begin? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've been asked in work to develop some in-house software for our support team, I haven't done much programming in the last 18 months since I left University and never really dedicated too much of my time then, but now I'm really wanting to dedicate some serious time to it and learn programming properly. I still remember all the basics from what I was taught. The language I used the most was Java, although I've spent some time iOS developing in Objective-C.
The program in question appears to be relatively simple, we have .dat files that contain information that we often compare against reports, I've been asked to create a data parser type program where the file is loaded and outputted into human readable format pref in columns, as currently, in it's raw form, it's a long string of numbers and letters.
I've been told that C++ would be a good way forward, although after borrowing a few books from the devlopers, they've told me that doing in a web based language might be better.
I can't really decide which avenue to go down, any input from here on in would be appreciated.

C++ tutorial or example of Redis for fastest insert [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Does anyone have any reference resources that show the fastest way to insert market data into a Redis Server? I am looking at data sets in the millions so I am trying to find some good coding examples to achieve this in C++ using a library like credis or hiredis. Does anyone have an end to end tutorial or set of source code examples of this? I seem to only find examples of simple connection testing or simple insertions.
Thanks
The fastest way to insert data into Redis is probably to use the pipe mode of the redis-cli client. More information here: http://redis.io/topics/mass-insert
Now if you are interested in a C API to efficiently send many queries to Redis, look no further than Hiredis. Hiredis is up-to-date and maintained by Redis authors. I suggest to use it over the other options, even in a C++ context.
Here is a simple pipelining example:
https://gist.github.com/1893378
a redis client written in C++ is redis-cplusplus-client, on github repo, depends on c++ boost library.

Importance of concurrency issues in web applications [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
With respect to this question, is it always necessary to have perfectly concurrent-safe web-application or can a developer afford to let some possible concurrent issues untreated (e.g. when money is not involved) because probability of them happening is very low anyway. What is the best practice?
NOTE: If I say concurrent issues I mean the issues raising from overlapping executions of scripts. I do not mean the multi-user issues like classic lost-update with the timestamp solution cause probability of these things is imho significant and I am pretty sure here that the best practice is to always treat them.
If you're going to run your code on a web server, you should always write your code in such a way that multiple copies of it can run at the same time.
It's really not that difficult. In most cases database transactions around state modifying operations is all that is needed (or using a lock file, or any of the other well-known solutions to this problem).
Note: if all you're doing is reading data then there are no concurrency issues.

Mallet vs Weka for text classification [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Which product (Mallet or Weka) is better for text classification task:
Simpler to train
Better results
Documentation
I'm new for this problem so any comments will be great
MALLET is much easier to use and does most of its job invisibly. You don't have to convert the format of anything either, you just give it text files and it gives you back results.
Weka requires converting the text into a particular format (the Weka script for doing so it so slow and inefficient that I would recommend you write your own).
The problem with MALLET is that the training uses GB of memory and it can take hours, if you have large training sets.
Weka has more documentation, but most of it makes no sense. MALLET has very little documentation but is very simple to use.
To be honest, after testing the both of them, I opted for writing my own classifier.
I'm really enjoying Weka vs Mallet. Maybe I don't know enough yet, but doing machine learning with a GUI is awesome. You can tweak parameters and run different experiments (keeping the results of past experiments in front of you, too) very easily. I'm new to Weka, so this is FWIW.
As far as which one is simpler to train, I find Weka simpler. I don't know what kind of control you can have over your feature space by just pointing Mallet at some text (maybe it's good enough), but my experience with Mallet was comparable to Weka... writing scripts to get the input in the proper format, with the caveat that I had to do multiple steps to utilize some kind of serialized version of the data in Mallet.
Regarding your other questions, I can't really answer them right now, but am hoping this answer doesn't get downvoted 'cause it's good information to be out there, anyway.