regex matching multiple values when they might not exist - regex

I am trying to right a preg_match_all to match horse race distance.
My source lists races as:
xmxfxy
I want to match the m value, the f value, the y value. However different races will maybe only have m, or f, or y, or two of them or even all three.
// e.g. $raw = 5f213y;
preg_match_all('/(\d{1,})m|(\d{1,})f|(\d{1,})y/', $raw, $distance);
The above sort of works, but for some reason the matches appear in unpredictable positions in the returned array. I guess it is because it is running the match 3 times for each OR. How do I match all three (that may or may not exist) in a single run.
EDIT
A full sample string is:
Hardings Catering Services Handicap (Div I) Cl6 5f213y

If I understand you correctly, you're processing listings (like the one in your question) one at a time. If that's the case, you should be using preg_match, not preg_match_all, and the regex should match the whole "distance" code, not individual components of it. Try this:
preg_match('#\b(?:(?<M>\d+)m|(?<F>\d+)f|(?<Y>\d+)y){1,3}\b#',
$raw, $distance);
The results are now stored in a one-dimensional array, but you don't need to worry about the group numbers anyway; you can access them by name instead (e.g., $distance['M'], $distance['F'], $distance['Y']).
Note that, while this regex matches codes with one, two, or three components, it doesn't require the letters to be unique. There's nothing to stop it from matching something like 1m2m3m (a weakness shared by your own approach, by the way).

you can use "?" as a conditional
preg_match_all('/((\d{1,})m)?|((\d{1,})f)?|((\d{1,})y)?/', $raw, $distance);

If I understand what you're asking correctly, you would like to get each number from these values separately? This works for me:
$input = "Hardings Catering Services Handicap (Div I) Cl6 5f213y";
preg_match_all('/((\d+)(m|f|y))/', $input, $matches);
After the preg_match_all() executes, $matches[2] holds an array of the numbers that matched (in this case, $matches[2][0] is 5 and $matches[2][1] is 213.
If all three values exist, m will be in $matches[2][0], f in $matches[2][1], and y in $matches[2][2]. If any values are missing, the next value gets bumped up a spot. It may also come in handy that $matches[3] will hold an array of the corresponding letter matched on, so if you need to check whether it was an m, f, or y, you can.
If this isn't what you're after, please provide an example of the output you would like to see for this or another sample input.

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

How do I find strings that only differ by their diacritics?

I'm comparing three lexical resources. I use entries from one of them to create queries — see first column — and see if the other two lexicons return the right answers. All wrong answers are written to a text file. Here's a sample out of 3000 lines:
réincarcérer<IND><FUT><REL><SG><1> réincarcèrerais réincarcérerais réincarcérerais
réinsérer<IND><FUT><ABS><PL><1> réinsèrerons réinsérerons réinsérerons
macérer<IND><FUT><ABS><PL><3> macèreront macéreront macéreront
répéter<IND><FUT><ABS><PL><1> répèterons répéterons répéterons
The first column is the query, the second is the reference. The third and fourth columns are the results returned by the lexicons. The values are tab-separated.
I'm trying to identify answers that only differ from the reference by their diacritics. That is, répèterons répéterons should match because the only difference between the two is that the second part has an acute accent on the e rather than a grave accent.
I'd like to match the entire line. I'd be grateful for a regex that would also identify answers that differ by their gemination — the following two lines should match because martellerait has two ls while martèlerait only has one.
modeler<IND><FUT><ABS><SG><2> modelleras modèleras modèleras
marteler<IND><FUT><REL><SG><3> martellerait martèlerait martèlerait
The last two values will always be identical. You can focus on values #2 and 3.
The first part can be achieved by doing a lossy conversion to ASCII and then doing a direct string comparison. Note, converting to ASCII effectively removes the diacritics.
To do the second part is not possible (as far as I know) with a regex pattern. You will need to do some research into things like the Levenshtein distance.
EDIT:
This regex will match duplicate consonants. It might be helpful for your gemination problem.
([b-df-hj-np-tv-xz])\\1+
Which means:
([b-df-hj-np-tv-xz]) # Match only consonants
\\1+ # Match one or times again what was captured in the first capture group

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Find rows that contain three digit number

I need to subset rows that contain <three digit number>
I wrote
foo <- grepl("<^[0-9]{3}$>", log1[,2])
others <- log1[!foo,]
but I'm not really sure how to use regex...just been using cheat sheets and Google. I think the < and > characters are throwing it off.
You almost had it. Try
^&lt[0-9]{3}&gt$
It might behoove you to read about anchors (^ and $).
The ^ and $ signs refer to the beginning and end of the string, respectively. You shouldn't be matching anything before or after them.
If you want rows that contain that pattern, you shouldn't use the anchors at all. You should just use this: <[0-9]{3}> (or shorten it to <\\d{3}>)
Just for posterity, I thought I would contribute what I think is the implied answer to the OP's stated question.
It seems the OP wants to exclude rows of a data frame where the second column contains a 3-digit integer. This can be done quite easily using the 'nchar' function to count the number of characters in each number, like so:
others <- log1[nchar(log1[,2])!=3,]
We are simply creating an array with the number of characters contained in each row of column 2 and selecting that row if the number does not equal 3.

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.