Search many strings over a very large text - c++

I have like 2 million strings and I need to search each of them over a 1 TB text data. Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. In other words, a trie in which each node in that is a word. I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose?
Let me be more descriptive in this question fellows,
For instance, I have these strings:
s1- "I love you"
s2- "How are you"
s3- "What's up dude"
And I have many text data like:
t1- "Hi, my name is Omid and I love computers. How are you guys?"
t2- "Your every wish will be done, they tell me..."
t3
t4
.
.
.
t10000
Then I want to consider each of texts and search for each of strings on them. At last for this example I would just say: t1 contains s1 and nothing else.
I am looking for an efficient way to search for strings but not foolishly for each of them each time.

I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. I'm pretty sure there is one fitting your need among them.

I would strongly suggest that you use CLucene (http://clucene.sourceforge.net/) which is a port from the Apache Lucene project. This will build you an inverted index and make text searching very fast. If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. It will be slower but has more features.

Related

Scratch project: To check if any word in a list is contained in an answer

I have the following Scratch project which has a "kind list" of words like: "good", "kind", "love", "come" etc.
A user should be able to enter any sentence containing any of these words, and the happy face would show.
Currently if the user types "kind" the happy face shows and if it types anything else like "you are kind", the sad face shows.
How do I change this, in scratch, such that if the user types in:
"you are kind" or
"how kind you are" or
"come here"
(any sentence containing any word in the "kindlist") the face is happy,else not.
I can only find a block that allows me to select the LIST and then the ANSWER and no other alternatives. What I want is the Python equivalent of > in list
answer=input("Say something")
If any word in the input answer (sentence) in in the list.
Then do - - -
For teaching purposes, I am trying to simplify what is on https://machinelearningforkids.co.uk/#!/newproject (creating of the training set). Can this be done directly in scratch or not? Or is this why the site allows you generate blocks on their site first and import them.
Surely Scratch should have the capability to enter data into lists and then test them directly.
I've also tried using a loop (which doesn't quite work correctly either) but was hoping there was a far simpler way.
I guess Scratch deliberately offers a minimal set of functions,
on the one hand not to overwhelm beginners,
on the other hand to encourage students to piece together simple blocks into more complex systems.
Yes, a simple (sentence) contains (word) is all you get out-of-the-box;
you do need a loop to match a multi-word sentence against a multi-word whitelist.
Seems to me like you would be better off with some development environment
that will at least give you some mature text parsing capabilities.
I'm not saying it's impossible to teach student about machine learning using Scratch, but I doubt it's the best tool for the job.
It feels like somebody wants to give music lessons, but students first have to go through the process of building a piano.
As for your code, it looks like a good start.
Some suggestions:
Replace the 'forever' loop with a loop bounded by the length of list 'kindthings'.
Include a leading and a trailing space in the 'contains' check, to make sure only whole words match. Wouldn't want 'unhappy' in a sentence to match 'happy' in the whitelist.

Identify keywords and commands in natural text

I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:
My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.
Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").
My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.
I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?
What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:
https://en.wikipedia.org/wiki/Information_extraction
To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.
Good luck!
If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.

Simple Phrases recognition

I am looking to recognize simple phrases like the ones what happens in google calendar
but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like
I spent 50 dollars on food yesterday
I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>
My question is do I go in for a full fledged Natural Language Processing like
given in these Questions and use Something like GATE
Machine Learning and Natural Language Processing
Natural Language Processing in Ruby
Ideas for Natural Language Processing project?
https://stackoverflow.com/a/3058063/492561
Or is it better to Write simple grammars using Something like AntLR and try to recognize it .
Or should I go really low and just Define a syntax and use Regular expressions .
Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.
Preferred languages : Python , Java , Ruby (Not in any particular order)
PS : This is not home-work , So please Don't tag it as so.
PPS : Please try to give an answer with Facts on why using a particular method is better.
even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .
You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.
Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:
1>>> import nltk
2>>> def extract_entities(text):
3... for sent in nltk.sent_tokenize(text):
4... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5... if hasattr(chunk, 'node'):
6... print chunk.node, ' '.join(c[0] for c in chunk.leaves())
The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them.
On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.
In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.
Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.
Take a look at the following frameworks for Java/Python:
GExp
GATE
NLTK. Python. Book chapter on Information Extraction.
UIMA. (used for IBM's Watson.)

Testing if a string contains one of several thousand substrings

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

Generate words that fit in Guids (just for fun)

I have some tests that use guids. The guids used don't need to be enormously unique, they just need to be guids. Random guids are boring - so I'm trying to find fun guid words. Right now, I don't have anything better than "00000000-feed-dada-iced-c0ffee000000". Ideally I'd generate a list of verbs, nouns, prepositions.
Having only spent a few minutes on this problem, here's where I am:
I have a word list (somewhat
large) from puzzlers.org.
Apply
this regex to identify words that
could be used in a Guid (o=0, i=1)
^[ABCDEFOI]{1,8}$
Squint.
Why doesn't someone have a funny guid generator available for my immediate gratification? How would you approach this? Any suggestions on how to improve this special guid generation process are welcome.
The solution you started is exactly how I would approach it. And it looks like someone already did the work for you:
http://nedbatchelder.com/text/hexwords.html
This isn't a technical answer but:
The Daily WTF had a post a while back, describing a guy who wrote the exact type of thing that you are trying to create, the reason it was Daily WTF material is because the generator ended up spitting out things that sounded like curse words.
From The Daily WTF - The Automated Curse Generator
Markov chains!" he blurted. "We can use statistical textual analysis to generate random words built up from natural phonemic combinations. They won't be real words, but they will match expected English patterns, and people will be able to pronounce and read them completely naturally."
I bet if you read that post you will get ideas about how to improve upon what you already have working.