Testing if a string contains one of several thousand substrings - regex

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.

You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection

If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.

Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.

Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.

For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!
The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

Search many strings over a very large text

I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?
Let me be more descriptive in this question fellows,
For instance, I have these strings:
s1- "I love you"
s2- "How are you"
s3- "What's up dude"
And I have many text data like:
t1- "Hi, my name is Omid and I love computers. How are you guys?"
t2- "Your every wish will be done, they tell me..."
t3
t4
.
.
.
t10000
Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else.
I am looking for an efficient way to search for strings but not foolishly for each of them each time.
I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. I'm pretty sure there is one fitting your need among them.
I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.

Best approach for doing full-text search with list-of-integers documents

I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details):
I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be.
So, when I want to query the system, I just have to compute the list of integers representing the query image, perform a full-text search (or similar) and retrieve the X most similar images.
My question is, what's the best approach to permorm such a search?
I've heard about Lucene, Lemur and other indexing methods, but I don't know if this kind of full-text searchs are the best way, given the domain is reduced (only integers instead of words).
I'd like to know about the alternatives in terms of efficiency, accuracy or C++ friendliness.
Thanks!
It sounds to me like you have a vectorspace model, so Lucene or a similar product may work well for you. In general, an inverted-index model will be good if:
You don't know the number of classes in advance
There are a lot of classes relative to the number of images
If your problem doesn't fit these criteria, a normal relational DB might work better, as Thomas suggested. If it meets #1 but not #2, you could investigate one of the "column oriented" non-relational databases. I'm not familiar enough with these to tell you how well they would work, but my intuition is that you'll need to replicate a lot of the functionality in an IR toolkit yourself.
Lucene is written in Java and I don't know of any C++ ports. Solr exposes Lucene as a web service, so it's easy enough to access it that way from whatever language you choose.
I don't know much about Lemur, but it looks like it has a similar vectorspace model, and it's written in C++, so that might be easier for you to use.
You can take a look at Lucene for image retrieval (LIRE) here: http://www.semanticmetadata.net/2006/05/19/lire-lucene-image-retrieval-04-released/
If I'm mistaken, you are trying to implement a typical bag of words image retrieval am I correct? If so you are probably trying to build an inverted file index. Lucene on its own is not suitable as you probably have already realized as it index text instead of numbers. Using its classes for querying the index would also be a problem as it is not designed to "parse" (i.e. detect keypoints, extract descriptors then vector-quantize them) image into the query vector.
LIRE on the other hand have been modified to index feature vectors. However, it does not appear to work out of the box for bag of words model. Also, I think I've read on the author's website that it currently uses brute force matching rather than the inverted file index to retrieve the images but I would expect it to be easier to extend than Lucene itself for your purposes.
Hope this helps.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.