Elixir punctuation replacement regex - regex

I'm trying to remove all punctuation from a string using
String.replace(sentence, ~r[\p{P}\p{S}], "")
However it's not removing all punctuation! As an illustrative example:
iex(1)> String.replace("foo!&^%$?", ~r[\p{P}\p{S}], "")
"foo!?"
What should I be using?

Possibly need /.../ as pattern delimiters:
String.replace("foo!&^%$?", ~r/[\p{P}\p{S}]/, "")
The result could be explained, because else [ ] would be used as delimiters in your sample, which corresponds to \p{P}\p{S} as a sequence and results in foo!? (see regex101 example)
Would additionally add a + quantifier: ~r/[\p{P}\p{S}]+/

If you're only working with strings in English, it's easiest and clearest to just use POSIX character classes:
String.replace("foo!&^%$?", ~r/[[:punct:]]/, "")

I'm late to the game, but you have to adjust the regex and customize it, especially if you're trying to preserve certain items, like a hyphen (which is considered punctuation in some language aspects).
My replace is a bit more verbose, but lets me control what I want to replace:
String.replace(str, ~r/[!#$%&()*+,.:;<=>?#\^_`{|}~-]/, "")
This let me keep the hyphen in a word, like co-operate, while removing :or other characters.

Related

Regex: match string unless it contains a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

RegEx to match string between delimiters or at the beginning or end

I am processing a CSV file and want to search and replace strings as long as it is an exact match in the column. For example:
xxx,Apple,Green Apple,xxx,xxx
Apple,xxx,xxx,Apple,xxx
xxx,xxx,Fruit/Apple,xxx,Apple
I want to replace 'Apple' if it is the EXACT value in the column (if it is contained in text within another column, I do not want to replace). I cannot see how to do this with a single expression (maybe not possible?).
The desired output is:
xxx,GRAPE,Green Apple,xxx,xxx
GRAPE,xxx,xxx,GRAPE,xxx
xxx,xxx,Fruit/Apple,xxx,GRAPE
So the expression I want is: match the beginning of input OR a comma, followed by desired string, followed by a comma OR the end of input.
You cannot put ^ or $ in character classes, so I tried \A and \Z but that didn't work.
([\A,])Apple([\Z,])
This didn't work, sadly. Can I do this with one regular expression? Seems like this would be a common enough problem.
It will depend on your language, but if the one you use supports lookarounds, then you would use something like this:
(?<=,|^)Apple(?=,|$)
Replace with GRAPE.
Otherwise, you will have to put back the commas:
(^|,)Apple(,|$)
Or
(\A|,)Apple(,|\Z)
And replace with:
\1GRAPE\2
Or
$1GRAPE$2
Depending on what's supported.
The above are raw regex (and replacement) strings. Escape as necessary.
Note: The disadvatage with the latter solution is that it will not work on strings like:
xxx,Apple,Apple,xxx,xxx
Since the comma after the first Apple got consumed. You'd have to call the regex replacement at most twice if you have such cases.
Oh, and I forgot to mention, you can have some 'hybrids' since some language have different levels of support for lookbehinds (in all the below ^ and \A, $ and \Z, \1 and $1 are interchangeable, just so I don't make it longer than it already is):
(?:(?<=,)|(?<=^))Apple(?=,|$)
For those where lookbehinds cannot be of variable width, replace with GRAPE.
(^|,)Apple(?=,|$)
And the above one for where lookaheads are supported but not lookbehinds. Replace with \1Apple.
This does as you wish:
Find what: (^|,)(?:Apple)(,|$)
Replace with: $1GRAPE$2
This works on regex101, in all flavors.
http://regex101.com/r/iP6dZ8
I wanted to share my original work-around (before the other answers), though it feels like more of a hack.
I simply prepend and append a comma on the string before doing the simpler:
/,Apple,/,GRAPE,/g
then cut off the first and last character.
PHP looks like:
$line = substr(preg_replace($search, $replace, ','.$line.','), 1, -1);
This still suffers from the problem of consecutive columns (e.g. ",Apple,Apple,").

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

How to negate specific word in regex? [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Regex - Multiline Problem

I think I'm burnt out, and that's why I can't see an obvious mistake. Anyway, I want the following regex:
#BIZ[.\s]*#ENDBIZ
to grab me the #BIZ tag, #ENDBIZ tag and all the text in between the tags. For example, if given some text, I want the expression to match:
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
At the moment, the regex matches nothing. What did I do wrong?
ADDITIONAL DETAILS
I'm doing the following in PHP
preg_replace('/#BIZ[.\s]*#ENDBIZ/', 'my new text', $strMultiplelines);
The dot loses its special meaning inside a character class — in other words, [.\s] means "match period or whitespace". I believe what you want is [\s\S], "match whitespace or non-whitespace".
preg_replace('/#BIZ[\s\S]*#ENDBIZ/', 'my new text', $strMultiplelines);
Edit: A bit about the dot and character classes:
By default, the dot does not match newlines. Most (all?) regex implementations have a way to specify that it match newlines as well, but it differs by implementation. The only way to match (really) any character in a compatible way is to pair a shorthand class with its negation — [\s\S], [\w\W], or [\d\D]. In my personal experience, the first seems to be most common, probably because this is used when you need to match newlines, and including \s makes it clear that you're doing so.
Also, the dot isn't the only special character which loses its meaning in character classes. In fact, the only characters which are special in character classes are ^, -, \, and ]. Check out the "Metacharacters Inside Character Classes" section of the character classes page on Regular-Expressions.info.
// Replaces all of your code with "my new text", but I do not think
// this is actually what you want based on your description.
preg_replace('/#BIZ(.+?)#ENDBIZ/s', 'my new text', $contents);
// Actually "gets" the text, which is what I think you might be looking for.
preg_match('/(#BIZ)(.+?)(#ENDBIZ)/s', $contents, $matches);
list($dummy, $startTag, $data, $endTag) = $matches;
This should work
#BIZ[\s\S]*#ENDBIZ
You can try this online Regular Expression Testing Tool
The mistake is the character group [.\s] that will match a dot (not any character) or white space. You probably tried to get .* with . matching newline characters, too. You achieve this by enabling the single line option ((?s:) does this in .NET regex).
(?s:#BIZ.*?#ENDBIZ)
Depending on the environment you're using your regex in, it may need special care to properly parse multiline text, eg re.DOTALL in Python. So what environment is that?
you can use
preg_replace('/#BIZ.*?#ENDBIZ/s', 'my new text', $strMultiplelines);
the 's' modifier says "match the dot with anything, even the newline character". the '?' says don't be greedy, such as for the case of:
foo
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
bar
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
hello world
the non-greediness won't get rid of the "bar" in the middle.
Unless I am missing something, you handle this the same way that you would in Perl, with either the /m or /s modifier at the end? Oddly enough the other answers that rather correctly pointed this out got down voted?!
It looks like you're doing a javascript regex, you'll need to enable multiline by specifying the m flag at the end of the expression:
var re = /^deal$/mg