Exclude elements from vector based on regular expression pattern - regex

I have some data which I want to clean up using a regular expression in R.
It is easy to find how to get elements that contain certain patterns, or do not contain certain words (strings), but I can't find out how to do this for excluding cells containing a pattern.
How could I use a general function to only keep those elements from a vector which do not contain PATTERN?
I prefer not to give an example, as this might lead people to answer using other (though usually nice) ways than the intended one: excluding based on a regular expression. Here goes anyway:
How to exclude all the elements that contain any of the following characters:
'pyfgcrl
vector <- c("Cecilia", "Cecily", "Cecily's", "Cedric", "Cedric's", "Celebes",
"Celebes's", "Celeste", "Celeste's", "Celia", "Celia's", "Celina")
The result would be an empty vector in this case.

Edit: From the comments, and with a little testing, one would find that my suggestion wasn't correct.
Here are two correct solutions:
vector[!grepl("['pyfgcrl]", vector)] ## kohske
grep("['pyfgcrl]", vector, value = TRUE, invert = TRUE) ## flodel
If either of them wants to re-post and accept credit for their answer, I'm more than happy to delete mine here.
Explanation
The general function that you are looking for is grepl. From the help file for grepl:
grepl returns a logical vector (match or not for each element of x).
Additionally, you should read the help page for regex which describes what character classes are. In this case, you create a character class ['pyfgcrl], which says to look for any character in the square brackets. You can then negate this with !.
So, up to this point, we have something that looks like:
!grepl("['pyfgcrl]", vector)
To get what you are looking for, you subset as usual.
vector[!grepl("['pyfgcrl]", vector)]
For the second solution, offered by #flodel, grep by default returns the position where a match is made, and the value = TRUE argument lets you return the actual string value instead. invert = TRUE means to return the values that were not matched.

Related

Negating Regular Expression for Price

I have a regular expression for matching price where decimals are optional like so,
/[0-9]+(\.[0-9]{1,2})?/
Now what I would like to do is get the inverse of the expression, but having trouble doing so. I came up with something simple like,
/[^0-9.]/g
But this allows for multiple '.' characters and more than 2 numbers after the decimal. I am using jQuery replace function on blur to correct an input price field. So if a user types in something like,
"S$4sd3.24151 . x45 blah blah text blah" or "!#%!$43.24.234asdf blah blah text blah"
it will return
43.24
Can anyone offer any suggestions for doing this?
I would do it in two steps. First delete any non-digit and non-dot-character with nothing.
/[^0-9.]//g
This will yield 43.24151.45 and 43.24.234 for the first and second example respectively.
Then you can use your first regex to match the first occurence of a valid price.
/\d(\.\d{1,2})?/
Doing this will give you 43.24 for both examples.
I suppose in programming, it is not always clear what "inverse" means.
To suggest a solution exclusively based on the example that you presented, I will present one that is very similar to what Vince presented. I am having difficulty composing a Regular Expression that both matches the pattern that you need and captures a potentially arbitrary number of digits, through repeating capture groups. And I am not sure whether this would be doable in some reasonable way (perhaps someone else does). But a two step approach should be straightforward.
To note, I suspect that you are referring to JavaScript's replace function, which is a member of the String Object, and not jQuery replaceWith and replaceAll functions, in referring to 'jQuery replace function.' The latter are 'Dom manipulation' functions. But, correct me if I misunderstood.
As an example, based on some hypothetical input, you could use
<b>var numeric_raw = jQuery('input.textbox').attr ('value').replace (/[^0-9.]/g, "")</b>
to remove all characters from a value entered in a text field that are not digits or periods;
then you could use
<b>var numeric_str = numeric_raw.replace (/^[0]*(\d+\.\d{1,2}).*$/, "$1")</b>
The difference between the classes specified here and in Vince's answer are in that I am including filtering for leading 0s.
To note, in Vince's first reg ex, there might be an extra '/' -- but perhaps it has a purpose that I didn't catch.
With respect to "inverse," one way to understand your initial inquiry is that you are looking for an expression that does the opposite of the one that you provided.
To note, while the expression that you provided (/[0-9]+(.[0-9]{1,2})?/) does match both whole numbers and decimal numbers with up to two fractional digits, it also matches any single digit -- so, it may identify a match where one might not be envisioned, for a given input string. The expression does not have anchors ('^', '$'), and so might allow multiple possible matches. For example, in the String "1.111", both "1.11" and "1" match the pattern that you provided.
It appears to me that one pattern that matches any string that does not match your pattern is the following, or at least does this for most cases can be this:
/^(?:(?!.*[0-9]+(\.[0-9]{1,2})?).*)*$/
-- if someone could identify a precisely 'inverse' pattern, please feel free -- I am having some trouble understanding how lookaheads are interpreted at least for some nuances.
This relies on "negative lookahead" functionality, which JavaScript these days supports. You could refer to several stackoverflow postings for more information (eg. Regular Expressions and negating a whole character group), and there are multiple resources that could be found on the Internet that discuss "lookahead" and "lookbehind."
I suppose this answer carries some redundancy with respect to the one already given -- I might have commented on the Original Poster's post or on Vince's answer (instead of writing at least parts of my answer), but I am not yet able to make comments!

Extracting static strings from a regular expression

I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.
Given a regex such as the one below
"fox jump(ed|ing|s)"
would give us
"fox,jumped,jumping,jumps"
Another example is
"fox jump(ed|ing|s)?"
which would give us
"fox,jump"
because of the optional operator
The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").
WHY?
I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.
See the library Xeger which given a regular expression will give you all the possible strings that match.
You seem to only want to keep the common prefix of these strings (the part where you said to ignore optional operators) but if you do that you might capture stings that have that common prefix yet do not have the ending you want (such as "jumpy" in your example). If this is not a problem then just find the shortest string given by Xeger, assuming that optional operators occur only at the end of the regex.
If I understand your problem correctly, you are looking for a set of words such that all these words are (disjoint) substrings of any word accepted by the (given) regular expression.
My guess is that such a set will very often be empty, but nevertheless it can be found.
To find such a set, I propose the following algorithm:
Find the FA corresponding to your input regex.
Identify bridges ( https://en.wikipedia.org/wiki/Bridge_(graph_theory) ) between the starting state S and the accepting state F. This can for example be done by removing an edge E and asking whether a path exists from S to E in the FA with E removed - repeat this for all edges.
All edges that are bridges must be hit during an accepting run, and each edge corresponds to a letter of input.
You may now generate the required words by connecting subsequent bridge edges end-to-end.
I think it follows from the algorithm construction that an FA (and not a DFA) suffices for this to work. Again, a proof would be nice but I think it should work:)

Can one find out which input characters matched which part of a regex?

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.
Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).
In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.
Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?
EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?
Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.
It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.
Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.
After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index.
When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.

regex - At most two pair of consecutives

I'm taking a computation course which also teaches about regular expressions. There is a difficult question that I cannot answer.
Find a regular expression for the language that accepts words that contains at most two pair of consecutive 0's. The alphabet consists of 0 and 1.
First, I made an NFA of the language but cannot convert it to a GNFA (that later be converted to regex). How can I find this regular expressin? With or without converting it to a GNFA?
(Since this is a homework problem, I'm assuming that you just want enough help to get started, and not a full worked solution?)
Your mileage may vary, but I don't really recommend trying to convert an NFA into a regular expression. The two are theoretically equivalent, and either can be converted into the other algorithmically, but in my opinion, it's not the most intuitive way to construct either one.
Instead, one approach is to start by enumerating various possibilities:
No pairs of consecutive zeroes at all; that is, every zero, except at the end of the string, must be followed by a one. So, the string consists of a mixed sequence of 1 and 01, optionally followed by 0:
(1|01)*(0|ε)
Exactly one pair of consecutive zeroes, at the end of the string. This is very similar to the previous:
(1|01)*00
Exactly one pair of consecutive zeroes, not at the end of the string — and, therefore, necessarily followed by a one. This is also very similar to the first one:
(1|01)*001(1|01)*(0|ε)
To continue that approach, you would then extend the above to support two pair of consecutive zeroes; and lastly, you would merge all of these into a single regular expression.
(0+1)*00(0+1)*00(0+1)* + (0+1)*000(0+1)*
contains at most two pair of consecutive 0's
(1|01)*(00|ε)(1|10)*(00|ε)(1|10)*

Deduplicating an array of keywords (but not based on EXACT match)

I have a list of a few thousand terms. There is significant overlap in those terms, but in different forms. For example (ruby, a_ruby), (triathlon, triathlete, triathletes), (nonprofit, non_profit, non_profits).
Most of these have significant number of character overlap, but not exactly in the same form. For example, (nonprofit and non_profit)
What regex sequence will be the best for this? I know that i can use stemming as well, but wondering how i can combine that with the regex.
For a single list of a few thousand items, I'd consider an alternate approach.
Sort the list alphabetically then manually remove the duplicates. Whatever regex and subsequent processing you end up with will probably take as much time if not more than going through the list manually.
Of course, I'm assuming this is a one-time proposition. I defer to regex experts for a programmatic solution.
I agree with Bob Kaufman that you should do a first pass to eliminate actual duplicates. After that, you have a problem that regex cannot solve for you; you will need to look into measurements of edit distance to get anywhere with it.
My usual strategy in this situation, which is not perfectly reliable, is as follows:
1) Remove all nonalphanumeric characters.
2) Make all strings lowercase.
3) Put all of the strings in a HashSet (this will remove duplicates).
4) Check for any cases where word and word+"s" are both in the set, and remove the plural one.
5) Output the strings in alphabetical order, and do a quick manual search for duplicates. If any are found, define new rules accordingly.
Other rules you may need:
Replace & with and.
Remove all instances of "inc"
Replace all instances of television with TV.