Compare two strings and find how closely they are related by meaning - data-mining

Problem:
I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell".
One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc.
Question:
Please suggest a better alternative, that uses any or all of Wikipepdia, DBpedia, Freebase and their cousins, or combines a different approach. I would really prefer open source software that can be downloaded and set up on a server (eg. Apache Mahout), rather than a paid web service.

It's not so much a matter of programming, but of data.
So it's not really a question for StackOverflow.
What you really want is to use WordNet I guess. That is really meant as a database for reasoning about the meaning of words. So for example, the data explicitely states that data mining is a form of data processing. And which is a physical entity...
You see, the reasoning will be only as good as your data is.
DBPedia may also include a mapping from WordNet to Wikipedia maybe?

You can't tell that "Thriller" is a song, not a music video or film genre or Lambchop album without additional context.
After you've identified what your items are, it's "simply" a matter of traversing the graph of connections in Freebase, MusicBrainz, or whatever other information sources you are using.
You'll need to decide how you're going to weight things for scoring though. Are two Michael Jackson songs more closely related because they share the same type or are they more closely related to the artist Michael Jackson because they're directly connect to him?

Related

NER model to recognize Indian names

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.
Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names:
http://nlp.stanford.edu/software/crf-faq.shtml#a
This website has done this for us!It provides with the solution for these problems:
Challenges in Indian Language NER
Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
The challenges in NER arise due to several factors. Some of the main factors are listed below
Morphologically rich - identification of root is difficult, require use of morphological analysers
No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".
The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you:
https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

What's the difference between core data, essential data and sample data in hybris?

In the hybris wiki trails, there is mention of core data vs. essential data vs. sample data. What is the difference between these three types of data?
Ordinarily, I would assume that sample data is illustrative gobbledygook data created to populate the example apparel and electronics storefronts. However, the wiki trails suggest that core data is for non-store specific data and the sample data is for store specific data.
On the same page, the wiki states that core data contains cockpit and catalog definitions, email templates, CMS layout, and site definitions (countries and user groups impex are included in this as well). This seems rather store specific to me. Does anyone have an explanation for this?
Yes, I have an explanation. Actually a lot of this is down to arbitrary decisions I made on separating data between acceleratorcore and acceleratorsampledata extensions as part of the Accelerator in 4.5 (later these had y- prefix added).
Essential and Project Data are two sets of data that are used within hybris' init/update process. These steps are controlled for each extension via particular Annotations on classes and methods.
Core vs Sample data is more about if I thought the impex file, or lines, were specific to the sample store or were more general. You will notice your CoreSystemSetup has both essential and projectdata steps.
Lots of work has happened in various continents since then, so, like much of hybris now, its a bit of a mess.
There are a few fun bugs related to hybris making certain things part of essentialdata. But these are in the platform not something I can fix without complaining to various people etc.
To confuse matters further, there is the yacceleratorinitialdata extension. This extension was a way I hoped to make projects easier, by giving some impex skeletons for new sites and stores. This would be generated for you during modulegen. It has rotted heavily though since release, now very out of date.
For a better explanation, take a look at this answer from answers.sap.com.
Hybris imports two types of data on initialization and update processes; first is essentialdata and other one is projectdata.
Essentialdata is the coredata setup which is mandatory and will import when you run initialization or update.
sampledata is your projectdata and it is not mandatory it will import when you select project while updating the system.

how to realize a network-based book query system

Please notice that if i do not want to use database.
i am now learning Unix networking programming. And i have my university book library all book list in seperated txt files. for example, the 'b' begin books is stored in b.txt. all a-z book count is about 1 million record. a line for a book' name and detailed other info.
Now i want to do a program to provide the query service of book list, for example, giving a book name, it can return the detailed info of this bool is it exists.
So i need to first build a module to take the function of query.
Then write the server side to call the query module and get the result and sending the result to the client module.
My question is , if i do not using database. How to realize the query module using c/c++, just first locating the first letter, for example, H begin book name should find in H.txt or H1.txt and H2.txt, using fopen open the file, then read line by line, then compare with queried book name using strFind, strCmp similar function, if have then return the result. i just think this is a time consuming thing and is not realize for using. And if have any such query system could for reference not using database but is bearable in time?
There are several options. The cheapest option (=low development time, low maintenance, low hardware requirements), IMO, is to create a html page on a separate site that links to all the data files. Then you set up another page that uses google.com to search that site. Then you just tell the google web spider to index your site. That way you get excellent performance with minimal work. But... you don't get to program any C.
Simple solution using C:
Do as you yourself suggest. If you have lots of memory available for file caching the performance won't be so bad unless the load gets high.
There will still be some work to do with the rest of the solution since you should delegate the search to worker threads.
Intermediate solution using C:
Find a 3rd party search engine and integrate it with your network code.
Advanced solution using C:
Implement your own search engine.
The problem is WHY DON'T U WANT TO USE DATABASE ?
1. make it easier to deploy?
sqlite may a good choice .
2. trying another method ?
lucene is a good choice of information retrieval, which is written by java.
clucene is someone rewrite the lucene to c .
You may also need stemmer tool(get the root of words),ictclas(chinese words' term extract) etc .
3. would like to do anything by yourself ?
It is easy to manage text file in system , while , as for a "query system", store is not enough , the main problem is IR(information retrieval ).
You may learn something about index building, store and query the index

Best approach for doing full-text search with list-of-integers documents

I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details):
I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be.
So, when I want to query the system, I just have to compute the list of integers representing the query image, perform a full-text search (or similar) and retrieve the X most similar images.
My question is, what's the best approach to permorm such a search?
I've heard about Lucene, Lemur and other indexing methods, but I don't know if this kind of full-text searchs are the best way, given the domain is reduced (only integers instead of words).
I'd like to know about the alternatives in terms of efficiency, accuracy or C++ friendliness.
Thanks!
It sounds to me like you have a vectorspace model, so Lucene or a similar product may work well for you. In general, an inverted-index model will be good if:
You don't know the number of classes in advance
There are a lot of classes relative to the number of images
If your problem doesn't fit these criteria, a normal relational DB might work better, as Thomas suggested. If it meets #1 but not #2, you could investigate one of the "column oriented" non-relational databases. I'm not familiar enough with these to tell you how well they would work, but my intuition is that you'll need to replicate a lot of the functionality in an IR toolkit yourself.
Lucene is written in Java and I don't know of any C++ ports. Solr exposes Lucene as a web service, so it's easy enough to access it that way from whatever language you choose.
I don't know much about Lemur, but it looks like it has a similar vectorspace model, and it's written in C++, so that might be easier for you to use.
You can take a look at Lucene for image retrieval (LIRE) here: http://www.semanticmetadata.net/2006/05/19/lire-lucene-image-retrieval-04-released/
If I'm mistaken, you are trying to implement a typical bag of words image retrieval am I correct? If so you are probably trying to build an inverted file index. Lucene on its own is not suitable as you probably have already realized as it index text instead of numbers. Using its classes for querying the index would also be a problem as it is not designed to "parse" (i.e. detect keypoints, extract descriptors then vector-quantize them) image into the query vector.
LIRE on the other hand have been modified to index feature vectors. However, it does not appear to work out of the box for bag of words model. Also, I think I've read on the author's website that it currently uses brute force matching rather than the inverted file index to retrieve the images but I would expect it to be easier to extend than Lucene itself for your purposes.
Hope this helps.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.