Generate words that fit in Guids (just for fun) - regex

I have some tests that use guids. The guids used don't need to be enormously unique, they just need to be guids. Random guids are boring - so I'm trying to find fun guid words. Right now, I don't have anything better than "00000000-feed-dada-iced-c0ffee000000". Ideally I'd generate a list of verbs, nouns, prepositions.
Having only spent a few minutes on this problem, here's where I am:
I have a word list (somewhat
large) from puzzlers.org.
Apply
this regex to identify words that
could be used in a Guid (o=0, i=1)
^[ABCDEFOI]{1,8}$
Squint.
Why doesn't someone have a funny guid generator available for my immediate gratification? How would you approach this? Any suggestions on how to improve this special guid generation process are welcome.

The solution you started is exactly how I would approach it. And it looks like someone already did the work for you:
http://nedbatchelder.com/text/hexwords.html

This isn't a technical answer but:
The Daily WTF had a post a while back, describing a guy who wrote the exact type of thing that you are trying to create, the reason it was Daily WTF material is because the generator ended up spitting out things that sounded like curse words.
From The Daily WTF - The Automated Curse Generator
Markov chains!" he blurted. "We can use statistical textual analysis to generate random words built up from natural phonemic combinations. They won't be real words, but they will match expected English patterns, and people will be able to pronounce and read them completely naturally."
I bet if you read that post you will get ideas about how to improve upon what you already have working.

Related

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

solr PatternReplaceCharFilterFactory working unexpectedly

I am relatively new to Solr so please forgive me if I'm missing something obvious. I have an application that allows users to search for musical artists. The indexing comes from a read-only database with correct spellings so on the index side I have it figured out.
On the query side however I need to anticipate various spelling errors/differences and want to help solr find those instances. From our old home-grown search solution, I have a list of regex's and the artists they apply to. When I was trying to translate those to solr using the PatternReplaceCharFilterFactory, I noticed that some worked perfectly, while others didn't work at all ... with seeming no rhyme nor reason between them.
For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="em[ei]n[ei]m" replacement="Eminem"/>
accurately captures the common misspellings of Eminem. But for the band 311:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Tt]hree [Ee]leven" replacement="311"/>
Does not work. Another example is Nine Inch Nails:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((nine|9).*inch.*nails\b)|(n\.? ?i\.? ?n\.?\b)" replacement="Nine Inch Nails"/>
works perfectly for finding the most common patterns for the band's name. But for Eve 6:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Ee]ve.{0,4}([Ss]ix|6)" replacement="Eve 6"/>
Is there something fundamental I'm missing on the usage of this filter? I've tried a number of variations on the regex's I've mentioned above (even going so far as using literals like 'three eleven'), but still with no success. I've tried making the filter in question the only PatternReplaceCharFilterFactory in the analyzer. I also know for sure that these items are in the index correctly because when I search for the correct spelling it returns the proper results.
Any suggestions?
Snowdall
I suspect the problem is not with your Char Factory, but with what comes after all, specifically the tokenizer. If you use standard tokenizer, it will get rid of the numbers you have just put into your stream. If you don't need the text to be split into tokens, you could look at KeywordTokenizerFactory instead.
In general, the best way to troubleshoot this in Solr 4+ is the Analysis screen in the Admin WebUI. It allows you to enter your text against particular field type and see what happens with it after each component in the analysis chain.
I would recommend using the SynonymFilter for the kind of application you describe. It allows you to provide an external file where you list words and their synonyms, like:
eminem <=> emenem
nine <=> 9
If you precede this with a LowerCaseFilter, you won't have to fuss about case normalization in your synonyms. You should be able to handle the 311 case too as long as you don't tokenize (ie use a KeywordTokenizer as Alexander Rafalovitch suggested).

Simple Phrases recognition

I am looking to recognize simple phrases like the ones what happens in google calendar
but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like
I spent 50 dollars on food yesterday
I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>
My question is do I go in for a full fledged Natural Language Processing like
given in these Questions and use Something like GATE
Machine Learning and Natural Language Processing
Natural Language Processing in Ruby
Ideas for Natural Language Processing project?
https://stackoverflow.com/a/3058063/492561
Or is it better to Write simple grammars using Something like AntLR and try to recognize it .
Or should I go really low and just Define a syntax and use Regular expressions .
Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.
Preferred languages : Python , Java , Ruby (Not in any particular order)
PS : This is not home-work , So please Don't tag it as so.
PPS : Please try to give an answer with Facts on why using a particular method is better.
even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .
You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.
Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:
1>>> import nltk
2>>> def extract_entities(text):
3... for sent in nltk.sent_tokenize(text):
4... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5... if hasattr(chunk, 'node'):
6... print chunk.node, ' '.join(c[0] for c in chunk.leaves())
The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them.
On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.
In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.
Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.
Take a look at the following frameworks for Java/Python:
GExp
GATE
NLTK. Python. Book chapter on Information Extraction.
UIMA. (used for IBM's Watson.)

Testing if a string contains one of several thousand substrings

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

How to run a dictionary search against a large text file?

We're in the final stages of shipping our console game. On the Wii we're having the most problems with memory of course, so we're busy hunting down sloppy coding, packing bits, and so on.
I've done a dump of memory and used strings.exe (from sysinternals) to analyze it, but it's coming up with a lot of gunk like this:
''''$$$$ %%%%
''''$$$$%%%%####&&&&
''''$$$$((((!!!!$$$$''''((((####%%%%$$$$####((((
''))++.-$$%&''))
'')*>BZf8<S]^kgu[faniwkzgukzkzkz
'',,..EDCCEEONNL
I'm more interested in strings like this:
wood_wide_end.bmp
restroom_stonewall.bmp
...which mean we're still embedding some kinds of strings that need to be converted to ID's.
So my question is: what are some good ways of finding the stuff that's likely our debug data that we can eliminate?
I can do some rx's to hack off symbols or just search for certain kinds of strings. But what I'd really like to do is get a hold of a standard dictionary file and search my strings file against that. Seems slow if I were to build a big rx with aardvaark|alimony|archetype etc. Or will that work well enough if I do a .NET compiled rx assembly for it?
Looking for other ideas about how to find stuff we want to eliminate as well. Quick and dirty solutions, don't need elegant. Thanks!
First, I'd get a good word list. This NPL page has a good list of word lists of varying sizes and sources. What I would do is build a hash table of all the words in the word list, and then test each word that is output by strings against the word list. This is pretty easy to do in Python:
import sys
dictfile = open('your-word-list')
wordlist = frozenset(word.strip() for word in dictfile)
dictfile.close()
for line in sys.stdin:
# if any word in the line is in our list, print out the whole line
for word in line.split():
if word in wordlist:
print line
break
Then use it like this:
strings myexecutable.elf | python myscript.py
However, I think you're focusing your attention in the wrong place. Eliminating debug strings has very diminishing returns. Although eliminating debugging data is a Technical Certification Requirement that Nintendo requires you to do, I don't think they'll bounce you for having a couple of extra strings in your ELF.
Use a profiler and try to identify where you're using the most memory. Chances are, there will be a way to save huge amounts of memory with little effort if you focus your energy in the right place.
This sounds like an ideal task for a quick-and-dirty script in something supporting regex's. I'd probably do something in python real quick if it was me.
Here's how I would proceed:
Every time you encounter a string (from the strings.exe output), prompt the user as to whether they'd like to remember it in the dictionary or permanently ignore it. If the user chooses to permanently ignore the string, in the future when its encountered, don't prompt the user about it and throw it away. You can optionally keep an anti-dictionary file around to remember this for future runs of your script. Build up the dictionary file and for each string keep a count or any other info about it you'd like about it. Optionally sort by the number of times the string occurs, so you can focus on the most egregious offenders.
This sounds like an ideal task for learning a scripting language. I wouldn't bother messing with C#/C++ or anything real fancy to implement this.