Lingua::EN::FindNumber numify adding found english numerics - regex

I was looking for a way to convert English numerics to integers and found a great post here: Scalable Regex for English Numerals which is using perl. My issue with using numify stems from the method "adding" numbers together rather than just outputting them. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::FindNumber;
print numify("some text and stuff house bill forty three twenty");
produces 63 rather than than what I expected was 43 20
I am at a loss, being a perl newbie on how to get around this. Is there an overrides that I can somehow tell the methods to not do addition? My only guess at this is that it's simply concatenating the string and its and integer so it adds them?? even knowing that still sadly doesn't help me. Thanks to anyone in the know.

I think your problem here has to do with an ambiguous definition of how a number should be interpreted.
If numify merely checks for words that represent numbers in a sequence and adds them, then there's no way you can overcome this. You can try to implement your own grammar, but I don't think it's completely trivial.
You'd have to catch the first word representing a number, and then check the following words, and try to find a match to your rule. For instance, after "forty", you can have a number from 1 to 9 (one, two, etc...), or "thousand", or... "millions"... I think you get the idea, In this case, you get "three", so... add them up, the next word is twenty, which doesn't match any rule above, so start over as a new number.
Sorry if this seems like I'm just thinking out loud. don't know if there's a library that can do this for you, it's an ambiguous problem, as usual when you're parsing natural language.
Hope it helps!

I think the parser in Lingua::EN::FindNumber is kind of loose about what it considers a number, so that e.g. "three and twenty", "three twenty" or even "forty three twenty" as valid numbers. For that matter, looking at the source, it also seems to accept "baker's dozen", "eleventy-one" and "billiard" as numbers...

Related

Regex that matches a list of comma separated items in any order

I have three "Clue texts" that say:
SomeClue=someText
AnotherClue=somethingElse
YetAnotherClue=moreText
I need to parse a string and see if it contains exactly these 3 texts, separated by a comma. No Clue Text contains any comma.
The problem is, they can be in any order and they must be the only clues in the string.
Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText
SomeClue=someText,YetAnotherClue=moreText,AnotherClue=somethingElse
AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse
Non-Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText,
SomeClue=someText,YetAnotherClue=moreText,,AnotherClue=somethingElse
,AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,UselessText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,AClueThatIDontWant=wrongwrongwrong
Putting togheter what I found on other posts, I have:
(?=.*SomeClue=someText($|,))(?=.*AnotherClue=somethingElse($|,))(?=.*YetAnotherClue=moreText($|,))
This works as far as Clues and their order are concerned.
Unfortunately, I can't find a way to avoid adding a comma and then some stupid text at the end.
My real case has somewhat more complicated Clue Texts, because each of them is a small regex, but I am pretty sure once I know how to handle commas, the rest will be easy.
I think you'd be better off with a stronger tool than regexes (and I genuinely love regular expressions). Regexes aren't good with needing supplementary memory, which is what you have here: you need exactly these 3, but they can come in any order.
In principle, you could write a regex for each of the 6 permutations. But that would never scale. You ought to use something with parsing power.
I suggest writing a verification function in your favorite scripting language, made up of underlying string functions.
In basic Python, you could do (for instance)
ref = set(['SomeClue=someText', 'AnotherClue=somethingElse', 'YetAnotherClue=moreText'])
def ismatch(myline):
splt = myline.split(',')
return ref == set(splt)
You can tweak that as necessary, of course. Note that this nearly-complete solution is not really longer, and much more readable, than any regex would be.

Regex to check if input (alphanumeric) contains a number smaller than X

This will probably be easy for regex magicians, however I can't seem to figure out a way with my limited knowledge.
I need a regex that would check if an alphanumeric string contains a number smaller than a number (16539065 in my case).
For example the following should be matched:
alpha16000000beta
foo300bar
And the following should not be matched:
foo16539066bar
Help please.
EDIT: I'm aware that it's inefficient, however I'm doing it in a cPanel Account Level filter, which only accepts regex. Unless I figure out a way for it to trigger a script instead, this would definitely need to be done with regex. :(
Your best option for this kind of operation is to use a capture group to get the number and then use whatever language you are using to do the comparison. If you absolutely have to use a regex to do this, it will be extremely inefficient. To do so, you will need to combine a lot of similar expressions:
\d{1,7} will find any numbers with 1 to 7 digits, which will always be less than 16539065
1653906[1-4] will catch the absolute maximum values accepted
165390[1-5]\d will catch the next range of acceptable values
1653[1-8]\d{3} will continue on the acceptable range
Repeat the above until you reach 1[1-5]\d{6}
Once you have all of those expressions, they can be combined using the 'or' operator. Keep in mind that using regular expressions in this manner is considered to be bad practice and creates hard to read code.
Bad Karma might kill me, but here is a working solution for your cases (letters then numbers then letters). It will not work for e.g. ab12cd34de.
There is not really a way to shortcode anything, just the long way. I'm using a negative lookahead to check, that the number is not bigger or equal to 16539065.
^\D*(?!0*(?:\d{9}|2\d{7}|1[7-9]\d{6}|16[6-9]\d{5}|165[4-9]\d{4}|16539[1-9]\d{2}|165390[7-9]\d|1653906[5-9]))\d+\D*$
It checks for the general format ^\D*\d+\D*$ and then rolls 16539065 down to it's parts.
Here's a little demo to play around: https://regex101.com/r/aV6yQ9/1

Is it possible to check if a word is really an english word via regex?

When I say english I really mean vs gobbledy gook. I'm not trying to filter out maitre'd or espanol or whatever.
So basically I'm trying to test whether a word is made of entirely of pronounceable syllables.
So here would be a regex:
if re.match(r'^([^aeiouy]{1,3}[aeiouy]{1,3}[^aeiouy]{1,3}|[aeiouy]{1,3}[^aeiouy]{1,3})+
print "gobbledy gook!!!"
What its checking for: C=consonant V=vowel
CVC or VC groups of characters. groups are 1-3 characters in length
Does this make sense?,the_word) is None:
xCodexBlockxPlacexHolderx
What its checking for: C=consonant V=vowel
CVC or VC groups of characters. groups are 1-3 characters in length
Does this make sense?
Yes and no. It is, in a certain sense, possible; the comments give the trivial (and horribly verbose and sluggish) way to do so. But as to whether it is in any sense useful to abuse regexen for this task? No. There's simply far too much variation between valid words, and even the weakened verification you're doing that makes no attempt to handle plausible-but-wrong words like 'rong' will require absolutely impractical customization to do the job.
This sort of problem is why JWZ (Jamie Zawinski) said:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Matching unmatched strings based on a unknown pattern

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.
The situation is this:
Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).
For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:
{
{
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
}
{
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
}
}
{
{
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
}
}
Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).
Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).
First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.
Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.
My question to you is, what would be a logical approach in the above situation?
As for a reallife example:
Thanks in advance!
Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.
Well, this is just a basic idea but I think it would be a good start to try some experiments...
Building on #PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.
For example, you could use four measures:
How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above
You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.
R has a number of functions for cluster analysis. This might be a good starting point.
Afterthought: the measures can be almost anything you invent. Some more examples:
Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?
Enough for a least a weekend's tinkering...
I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.
Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv
A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)
With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.
Good luck!
Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:
[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]
Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.
May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.
Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.
I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.