Regex to identify all sorts of candidate legal numbers - regex

[This is a heavily re-edited version. Please ignore past versions of this question.]
A small python script using a sophisticated regex was provided by eyquem to identify numbers in a string and sanitize them. The test results cover over 50 samples, which I won't repeat here.
The question is, can someone adjust that regexp or provide a new one so that commas are treated more sanely?
In particular, I would like to see the following 4 test inputs produce the associated outputs.
' 4,8.3,5 ' -> '4' '8.3' '5'
' 44,22,333,888 ' -> '44' '22,333,888' #### Note that 44,22 is never a single number.
' 11,333e22,444 ' -> '11,333e22' '444' #### 11,333 is accepted in front of e22, but 22,444 is not accepted after it.
' 1,999 people found the code "i+=1999;" to be crystal clear in meaning and to likely lead to less than 1999 kilobytes extra memory consumption; however, the gains in 1, 999, and 1999 KB disk space are anything but ideal, especially this being 1999 and us having over $1,999 to work with! ' -> '1,999' '1999' '1999' '1' '999' '1999' '1999' '1,999'

Despite all the information, your post is actually vague. For starters, you didn't ask any questions. What is it you want?
Are you asking how to find all possible matches? In Perl, you can use
local our #matches;
/(...)(?{ push #matches, $1 })(?!)/
The (?!) never matches, so it causes the regex engine to backtrack to find another match, but the code block saves what it did find before doing that.
If you're asking to find any match, then it's quite easy to solve: Don't bother looking for option 2, because option 1 will always match what option 2 matches.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

Matching variable words in a string

This will sound extremely nerdy, but I play this online game that writes its in-game events to a log file. There's a program I'm using that is capable of reading this log file, and it's also capable of interpreting regex. My goal is to write a regex command that analyzes a certain string from this log file and then spits out certain parts of the string onto my screen.
The string that gets written to the log file has the following syntax (variables in bold):
NAME hits/bashes/crushes/claws/whatever NEWNAME for NUMBER points of damage.
If it matters, NUMBER will never contain commas or spaces, and the action verb (hits, bashes, whatever) will only ever be a single word without any special characters, spaces, numbers, etc.
What I'd like this program to do is interpret the regex code that I enter and spit out a result that says: NAME attacks NEWNAME
The catch is, NAME and NEWNAME can have the following range of possibilities (names and examples picked at random):
Kevin
Kevin's pet
Kevin from Oregon
Kevin from Oregon's pet
Kevin from Oregon`s pet (note the grave accent there instead of the apostrophe)
It's pretty simple if it's just something like Kevin hits Josh for 10728 points of damage. In this case, my regex is the following code block (please note that the program interprets the {N} wildcard on its own as any number without the need for regex):
(?<char1>\w+) \w+ (?<char2>\w+) for {N} points of damage.
...and my output reads...
${char1} attacks ${char2}
Whenever the game outputs that string of Kevin hits Josh for 10728 points of damage. to the log file, the program I'm using picks up on it and correctly outputs Kevin attacks Josh to my screen.
However, using that regex line results in a failure when spaces, apostrophes, grave accents, and/or any combination of the three are present in either NAME or NEWNAME.
I tried to alter the regex line to read...
(?<char1>[a-zA-Z0-9_ ]+) \w+ (?<char2>[a-zA-Z0-9_ ]+) for {N} points of damage.
...but when I encounter the string Kevin bashes Josh of Texas for 2132344 points of damage., for example, the output to my screen winds up being:
Kevin bashes Josh attacks Texas.
I'm trying different things but ultimately not coming up with something that's spitting out the proper format of NAME attacks NEWNAME when those two variables contain spaces, apostrophes, grave accents, and/or any combination of the three.
Any help or tips on what I'm doing wrong or how I can further alter that regex line would be extremely appreciated!
This is going to sound even nerdier, but I think the question isn't the regex, it's what tool you use the regex in.
Your biggest problem thus far has been the names. I suggest ignoring the names, and focusing only on the elements you know are there. The names are what's left.
I tried this myself using GNU sed:
sed -e 's/for [[:digit:]]\+ points of damage//' -e 's/hits\|bashes\|crushes/attacks/'
You see, first we can eliminate the end of the sentence, which is wholly superfluous. Then, we simply switch the verb to "attacks".
If the program uses a synonym for "attacks" that you don't have yet, you'll still have reasonable output; you can then fix your regex to include the new synonym.
You are guaranteed trouble if somebody's name includes "bashes" (or whatever) in it.
The second sed expression should be improved to be relevant only at a word boundary, but I'll leave that as an exercise for the reader. :)

Vim: Placing (,) in between CERTAIN high numbers Issue

source txt file:
34|Gurla Mandhata|7694|25243|2788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900
You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc
this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update
I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences