Mallet vs Weka for text classification [closed]

Mallet vs Weka for text classification [closed] - weka

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Which product (Mallet or Weka) is better for text classification task:
Simpler to train
Better results
Documentation
I'm new for this problem so any comments will be great

MALLET is much easier to use and does most of its job invisibly. You don't have to convert the format of anything either, you just give it text files and it gives you back results.
Weka requires converting the text into a particular format (the Weka script for doing so it so slow and inefficient that I would recommend you write your own).
The problem with MALLET is that the training uses GB of memory and it can take hours, if you have large training sets.
Weka has more documentation, but most of it makes no sense. MALLET has very little documentation but is very simple to use.
To be honest, after testing the both of them, I opted for writing my own classifier.

I'm really enjoying Weka vs Mallet. Maybe I don't know enough yet, but doing machine learning with a GUI is awesome. You can tweak parameters and run different experiments (keeping the results of past experiments in front of you, too) very easily. I'm new to Weka, so this is FWIW.
As far as which one is simpler to train, I find Weka simpler. I don't know what kind of control you can have over your feature space by just pointing Mallet at some text (maybe it's good enough), but my experience with Mallet was comparable to Weka... writing scripts to get the input in the proper format, with the caveat that I had to do multiple steps to utilize some kind of serialized version of the data in Mallet.
Regarding your other questions, I can't really answer them right now, but am hoping this answer doesn't get downvoted 'cause it's good information to be out there, anyway.

Related

The best design for a command-line tool [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
It's a simple task, let me briefly describe it!
I'm supposed to code a command-line tool that takes a file-name as an argument, the file
that I'm gonna read consists of lines, each line supposed to be a command to execute, the command is followed by it's appropriate arguments to apply on, to make it clear:
FILE
sum; 1, 2, 3, 4
Output
10
The command-line tool should satisfy those requirements:
1- Easily maintained, developed (more commands might be added in the future) and user-friendly.
2- Command line arguments might be modified and new could be added.
3- Can live as an open-source project, an organised source-tree.
I'm expecting developers to deal with the source-code and fairly understand it.
I'm a newbie in those stuff, I'm kinda new to design patterns so I don't know much, I wanna follow the best practices in developing this program, I really wanna use design patterns if applicable and make my code better and cleaner, so please advise and guide me to write this tool in the best possible way, I don't wanna write dirty code, I wanna write a high-quality code that does what it's intended to and could be easily developed further.
Please advise and feel free to criticize what I've just said.
One last thing, I'll be using C++!
Thanks!

1- Boost.Program_options is your friend when it comes to parsing command line options.
2- Take a look at the command pattern. Although it is easier to implement in language that has reflection facilities, it is still possible to have a map of "command strings" mapped to function. Please use C++11 facilities for this. i.e. std::function.
3- There is no standard structure for C++ projects. Personally, I use Boost's recommended structure.

Easy to use PNG lib? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Could someone recommend me a simple, easy to use PNG library either for c++ or .NET? All it needs to do is: load big PNG images (say 20000x20000), and tell me what color each pixel has.
Bitmap class in .NET can't load big images, throws an OutOfMemory exception.
I spent reasonable time on google looking through c++ libs, but all of them does much more than I need, and their usage is too complicated for me.

The defacto standard library for PNG files is LibPNG. It's not the best designed API in the world, but if you just work through the steps in one of their tutorials, it's pretty hard to mess up.
You'll probably find it easiest to wrap their API in a few simple functions (or class) of your own. Once you have that done, you should be good to go.

Try this:
http://nothings.org/stb_image.c
You can use it instead of zlib aswell.

If C# is an option, try PNGCS. It was done (by myself, in Java originally) for this scenario, it allows you to read and write line by line, no need to have all data in memory.
I have tested that it can read and write huge files (30000 x 30000 pixels, more than 2GB in disk), at least in Java

For C++:
Depending on the license you are able to use you may have a look at:
DeVIL: http://openil.sourceforge.net/(a bit outdated but a still good choice) (Linux, Win)
ImageMagick: http://www.imagemagick.org/script/index.php (well maintained, all platforms)
both support a variety of input and output formats.
EDIT: now also on Github: https://github.com/DentonW/DevIL

which is one is faster Hbase or Hypertable? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I had a requirement to store millions of records in which all are unique with multiple columns.
for example
eventcode description count
526 blocked 100
5230 xxx 20
....
and I want the following requirements while fetching sorting on count column, filtering on columns.
So I thought of using Hbase but I googled up and known that hypertable is faster.
So I am bit confused to know it.
please help me regarding this.
Note: I want to use C++ for transactions (reading, writing).

BIG disclaimer: i work for hypertable.
We have created a benchmark a while ago which you can read here: http://hypertable.com//why_hypertable/hypertable_vs_hbase_2/
Conclusion: Hypertable is faster, usually twice as fast.
Performance actually was the reason why hypertable was founded. Back then some guys were sitting together and discussing an open source implementation of Google's bigtable architecture. They did not agree on the programming language (java vs. c++ - the disagreement was about performance). As a result, one group founded hypertable (a C++ implementation) and the other group started working on hbase (in java).
If you do not trust benchmarks then you will have to run your own; both systems are open source and free to use. If you have questions about hypertable or run into problems while evaluating it then feel free to drop me a mail (or use the mailing list - all questions are getting answered.)
Btw - hypertable does not (yet) support sorting. You will have to implement this in your client application.

How to create a tool using C++ to parse XML files [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have some XML files that contain different function information. I am trying to create a tool that may extract that information from these files once they are created (function name, arguments number, type, returned values, etc.)
Later i will be manipulating these extracted information to create a new XML file. I have limited programming experience and it's all in C++. Any hint for the start would be appreciated.

If you just want to be able to read and write XML files, its probably just best to use an XML library rather than reinventing the wheel.
Since its not completely clear what you're trying to do, a good place to start would be this thread: What is the best open XML parser for C++? [Stack Overflow]
It's currently closed but it has a couple good answers that will help you figure out the best library to use for your situation.
If you need more help using it, feel free to edit your question, comment, or post a more specific question on the topic.
Happy coding and good luck!

I would suggest TinyXML, it's small, light-weight and suitable in many cases, it also has a non-viral license.
I used it a lot and it was very useful and ... just in case, there's also TinyXPath.

Unless you are doing this as a project to improve your C++ ability and understanding of XML, I would advise against trying to write the parsing code yourself. You will get much faster results using something out there that is already written and well established and there are a number of choices that are open to you. I personally like RapidXML. It is very simple to add to your code (you just need to #include one or two .hpp files - no libraries are needed) and it does everything that I have needed so far which is mainly parsing data from SOAP responses. The site also provides a comprehensive tutorial which enables you to get up and running very quickly.

Useful features to learn in boost for immediate use [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I know this could be seen as subjective off-the-cuff (thus a poor question), but bear with me.
Boost has recently become available on the project on which I'm working, and I don't have much experience with it. Boost has so many parts and features that it's hard to know where to get started in learning it - especially since I'll be trying to learn it while making production code.
So, I would greatly appreciate it if someone could list around 3 to 5 features which are very useful in general, every-day programming and state why they're useful. I'm not asking you which is best, or trying to get a debate - I just want to know some good features to start learning and using immediately. I don't need code samples either, I'll be more than happy to research how to use the features myself after I know which ones are sensible to start learning now.
I'll accept any answer with a concise list of features that are sensible :)

format and lexical_cast are great for string manipulation, I find them invaluable. I use them every day.
bind is great for ad hoc functors, you'll find it is reused throughout many of the boost libraries.
multi_index fills the gap of when you need the same data in two search structures at once, it is very handy at times. Keep it out of your headers though.
type_traits defines useful traits for template specializations.
signals is a signal/slot mechanism implementation, great for event driven designs.

shared_pointer is critical. It lets you automatically handle memory usage.
http://www.boost.org/doc/libs/1_47_0/libs/smart_ptr/shared_ptr.htm

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js