Open source concept mining tools? - data-mining

Are there to day any concept mining open source tools available? I have only be coming across like Leximancer, which although seem to fit the role is not open source and quite expensive for a undergraduate student. I have been unsuccessful so far since the word 'concept' on both google and google scholar seems to be un-matching what I want.

It seems to me you need a text mining tool for clustering. RapidMiner has an open-source, Java based Community Edition which has several extensions (Text Mining, R, etc.). In addition you can develop and integrate your own algorithms too.
Moreover Rexer Analytics offers a comprehensive data mining survey annually, you can call for reports for free.

Related

Existing API for NLP in C++?

Is/are there existing C++ NLP API(s) out there? The closest thing I have found is CLucene, a port of Lucene. However, it seems a bit obsolete and the documentation is far from complete.
Ideally, this/these API(s) would permit tokenization, stemming and PoS tagging.
Freeling is written in C++ too, although most people just use their binaries to run the tools: http://devel.cpl.upc.edu/freeling/downloads?order=time&desc=1
Try something like DyNet, it's a generic neural net framework but most of its processes are focusing on NLP because the maintainers are creators of the NLP community.
Or perhaps Marian-NMT, it was designed for sequence-to-sequence model machine translation but potentially many NLP tasks can be structured as a sequence-to-sequence task.
Outdated
Maybe you can try Ellogon http://www.ellogon.org/ , they have GUI support and also C/C++ API for NLP too.
if you remove the restriction on c++ , you get the perfect NLTK (python)
the remaining effort is then interfacing between python and c++.
Apache Lucy would get you part of the way there. It is under active development.
Maybe you can use Weka-C++. It's the very popular Weka library for machine learning and data mining (including NLP) ported from Java to C++.
Weka supports tokenization and stemming, you'll probably need to train a classifier for PoS tagging.
I only used Weka with Java though, so I'm afraid can't give you more details on this version.
There is TurboParser by André Martins at CMU, also has a Python wrapper. There is is an online demo for it.
This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.
MITIE is built on top of dlib, a high-performance machine-learning library, MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings and Structural Support Vector Machines[3]. MITIE offers several pre-trained models providing varying levels of support for both English and Spanish, trained using a variety of linguistic resources (e.g., CoNLL 2003, ACE, Wikipedia, Freebase, and Gigaword). The core MITIE software is written in C++, but bindings for several other software languages including Python, R, Java, C, and MATLAB allow a user to quickly integrate MITIE into his/her own applications.
https://github.com/mit-nlp/MITIE

what is the Open Cloud Computing Interface?

I am going to develop a cloud application and in my research for state of the art tools in Cloud Computing i saw some references to OCCI (Open Cloud Computing Interface).
I was not able to find out an answer to the following questions
1)Is it easy to use this Interface ?
2)What programming languages does this interface Supports ?
3)Is this Interface mature enough?
Any information are well appreciated!
This question has been asked quite some time ago but, hopefully, the answer is still relevant.
Is it easy to use?
Depends on what you want. If you want to make your own implementation, then probably not. If you use one of the existing implementations (see bellow), then yes.
What programming languages does this interface Support?
We know about two implementations (libraries, CLI), which are for Ruby and Java. See:
https://wiki.egi.eu/wiki/rOCCI:ROCCI
https://github.com/EGI-FCTF/jOCCI-api
rOCCI (the first one) also as a server side (the rOCCI-server) that translates OCCI to propriatary cloud management platforms such as OpenNebula.
Is this Interface mature enough?
Yes, given that it is being used by real-world infrastructures. Among them, e.g., the EGI Federated Cloud. That said, the current OCCI specification (1.1) has a few shortcomings that will be addressed in version 1.2 (due in Autumn 2015), so that if someone is just starting a project, it is worth implementing with 1.2 already.
Many of your questions can be answered (positively, by the way!) by visiting the OCCI-WG home site at http://occi-wg.org and/or searching on "occi implementation".
Another recent and useful resource is the tutorials and workshop talks given at the recent Cloud Interoperability Week held simultaneously with events in Madrid and Santa Clara, part of the Cloud Plugfest hands-on developer training series:
Or generally at http://www.cloudplugfest.org/
The basic specs are published by the Open Grid Forum.
The Open Cloud Computing Interface (OCCI) is a set of specifications delivered through the Open Grid Forum, for cloud computing service providers. OCCI has a set of implementations that act as proofs of concept. It builds upon World Wide Web fundamentals by using the Representational State Transfer (REST) approach for interacting with services.

Starting with Data Mining

I have started learning Data Mining and wish to create a small project in C++/Java that allows me to utilize a database, say from twitter and then publish a particular set of results (for eg. all the news items on a feed). I want to know how to go about it? Where should I start?
This is a really broad question, so it's hard to answer. Here are some things to consider:
Where are you going to get the data? You mention twitter, but you'll still need to collect the data in some way. There are probably libraries out there for listening to twitter streams, or you could probably buy the data if someone is selling it.
Where are you going to store the data? Depending on how much you'll have and what you plan to do with it, a traditional relational database may or may not be the best fit. You may be better off with something that supports running mapreduce jobs out-of-the box.
Based on the answers to those questions, the choice of programming languages and libraries will be easier to make.
If you're really set on Java, then I think a Hadoop cluster is probably what you want to start out with. It supports writing mapreduce jobs in Java, and works as an effective platform for other systems such as HBase, a column-oriented datastore.
If your data are going to be fairly regular (that is, not much variation in structure from one record to the next), maybe Hive would be a better fit. With Hive, you can write SQL-like queries, given only data files as input. I've never used Mahout, but I understand that its machine learning capabilities are suited for data mining tasks.
These are just some ideas that come to mind. There are lots of options out there and choosing between them has as much to do with the particular problem you're trying to solve and your own personal tastes as anything else.
If you just want to start learning about Data Mining there are two books that I particularly really enjoy:
Pattern Recognition and Machine Learning. Christopher M. Bishop. Springer.
And this one, which is for free:
http://infolab.stanford.edu/~ullman/mmds.html
Good references for you are
AI course taught by people who actually know the subject,Weka website, Machine Learning datasets, Even more datasets, Framework for supporting the mining of larger datasets.
The first link is a good introduction on AI taught by Peter Norvig and Sebastian Thrun, Google's Research Director, and Stanley's creator (the autonomous car), respectively.
The second link you get you to Weka website. Download the software - which is pretty intuitive - and get the book. Make sure you understand all the concepts: what's data mining, what's machine learning, what are the most common tasks, and what are the rationales behind them. Play a lot with the examples - the software package bundles some datasets - until you understand what generated the results.
Next, go to real datasets and play with them. When tackling massive datasets, you may face several performance issues with Weka - which is more of a learning tool as far as my experience can tell. Thus I recommend you to take a look at the fifth link, which will get you to Apache Mahout website.
It's far from being a simple topic, however, it's quite interesting.
I can tell you how I did it.
1) I got the data using twitter4j.
2) I analyzed the data using JUNG.
You have to define a class representing edges and a class representing vertices.
These classes will contain the attributes of the edges and vertices.
3) Then, there is a simple function to add an edge g.addedge(V1,V2,edgeFromV1ToV2) or to add a vertex g.addVertex(V).
The class that defines edges or vertices is easy to create. As an example :
`public class MyEdge {
int Id;
}`
The same is done for vertices.
Today I would do it with R, but if you don't want to learn a new programming language, just import jung which is a java library.
Data mining is broad fields with many different techniques; classification, clustering, association and pattern mining, outlier detection, etc.
You should first decide what you want to do and then decide wich algorithm you need.
If you are new to data mining, I would recommend to read some books like Introduction to Data Mining by Tan, Steinbach and Kumar.
I would like to suggest you to use python or R for data mining process. Doing work with java or c , it bit difficult in the sense you need to do a lot coding

Enterprise-grade template printing system

I'm looking for an enterprise-grade template printing system. I'm interested in every software I can get my hands on to evaluate. Commercial or not.
What I need - a separate system ready to receive tags in order to print (digital or paper) a template (like a contract, invoice, etc). Templates should be managed by the same software. It should operate via web services or via enterprise bus (preferable JMS or MQSeries connectors).
Can I ask for some names and possibly some URLs? Anything will be helpful even if it does not fit the requirements exactly.
Thanks.
This is an old question, but for the Googlers out there, we use a couple of products to render documents in XSL-FO (a W3C standard paper specification that we generate using XSL) either to PDF, PostScript, etc. We use it to show documents online as well as bulk print a few hundred thousand of them monthly.
RenderX (.NET, Java, whatever)
provides a very powerful solution for
our bulk printing needs
IBEX PDF Creator (.NET
only) for online rendering to PDF
Calligo is a commercial package from InSystems. Can't reach the web site right now; could be a bad sign.
Then there are these open source possibilities.

Data mining and Business Intelligence Technologies

I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?
A very good book with practical examples is the
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran.
Go read Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Then use Weka on a pet project. Despite the name, this is a good book, and the Weka package has several levels of entry into the data-m... er machine-learning world.
Consider reading Ralph Kimball's books for an introduction to Business Intelligence.
Also, try to not stick to one technolgy-vendor, every company has its own biased vision of BI, you'll need a 360 overview.
Maybe you can also try to work with real BI - it is almost impossible to get in contact with data-filled and running SAS, MS, Oracle etc. I work in a team which integrates BI BellaDati for enterprises. For try-out and personal purposes it is free with some datastore limitations ( http://www.trgiman.eu/en/belladati/product/personal ).
BellaDati is also used as a learning tool on technical universities focused on practical application of data mining and analysis. The final manager-level dashboards examples of BellaDati can be seen at http://mercato.belladati.com/bi/mercato/show/worldexchanges
You can work here with SQL datasources, flat files, web services and play. From my own experience - to show real samples of market analysis practise (like case study etc.) is good for an interview.
I wish you luck,
Peter