python 3 regex string matching ignore whitespace and string.punctuation - regex

I am new to regex and would like to know how to pattern match two strings. The use case would be something like finding a certain phrase in some text. I'm using python 3.7 if that makes a difference.
phrase = "some phrase" #the phrase I'm searching for
Possible matches:
text = "some##$#phrase"
^^^^ #non-alphanumeric can be treated like a single space
text = "some phrase"
text = "!!!some!!! phrase!!!"
These are not matches:
text = "some phrases"
^ #the 's' on the end makes it false
text = "ssome phrase"
text = "some other phrase"
I have tried using something like:
re.search(r'\b'+phrase+'\b', text)
I would very much appreciate an explanation of why the regex works if you provide a valid solution.

You should use something like this:
re.search(r'\bsome\W+phrase\b', text)
'\W' means non-word character
'+' means one or more times
In case you have a given phrase in a variable, you could try this before:
some_phrase = some_phrase.replace(r' ', r'\W+')

Related

Regex Match whole word string in coldfusion

Im trying this example
first example
keyword = "star";
myString = "The dog sniffed at the star fish and growled";
regEx = "\b"& keyword &"\b";
if (reFindNoCase(regEx, myString)) {
writeOutput("found it");
} else {
writeOutput("did not find it");
}
Example output -> found it
second example
keyword = "star";
myString = "The dog sniffed at the .star fish and growled";
regEx = "\b"& keyword &"\b";
if (reFindNoCase(regEx, myString)) {
writeOutput("found it");
} else {
writeOutput("did not find it");
}
output -> found it
but i want to find only whole word. punctuation issue for me how can i using regex for second example output: did not find it
Coldfusion does not support lookbehind, so, you cannot use a real "zero-width boundary" check. Instead, you can use groupings (and fortunately a lookahead):
regEx = "(^|\W)"& keyword &"(?=\W|$)";
Here, (^|\W) matches either the start of a string, and (?=\W|$) makes sure there is either a non-word character (\W) or the end of string ($).
See the regex demo
However, make sure you escape your keyword before passing to the regex. See ColdFusion 10 now provides reEscape() to prepare string literals for native RE-methods.
Another way is to match spaces or start/end of string:
<cfset regEx = "(^|\s)" & TABLE_NAME & "($|\s)">

Whole word replacements using Regular Expression

I have a list of original words and replace with words which I want to replace occurrence of the original words in some sentences to the replace words.
For example my list:
theabove the above
myaddress my address
So the sentence "This is theabove." will become "This is the above."
I am using Regular Expression in VB like this:
Dim strPattern As String
Dim regex As New RegExp
regex.Global = True
If Not IsEmpty(myReplacementList) Then
For intRow = 0 To UBound(myReplacementList, 2)
strReplaceWith = IIf(IsNull(myReplacementList(COL_REPLACEMENTWORD, intRow)), " ", varReplacements(COL_REPLACEMENTWORD, intRow))
strPattern = "\b" & myReplacementList(COL_ORIGINALWORD, intRow) & "\b"
regex.Pattern = strPattern
TextToCleanUp = regex.Replace(TextToReplace, strReplaceWith)
Next
End If
I loop all entries in my list myReplacementList against the text TextToReplace I want to process, and the replacement have to be whole word so I used the "\b" token around the original word.
It works well but I have a problem when the original words contain some special characters for example
overla) overlay
I try to escape the ) in the pattern but it does not work:
\boverla\)\\b
I can't replace the sentence "This word is overla) with that word." to "This word is overlay with that word."
Not sure what is missing? Is regular expression the way to the above scenario?
I'd use string.replace().
That way you don't have to escape special chars .. only these: ""!
See here for examples: http://www.dotnetperls.com/replace-vbnet
Regex is good if your looking for patterns. Or renaming your mp3 collection ;-) and much, much more. But in your case, I'd use string.replace().

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}

How to unpunctuate, lowercase, de-space and hyphenate a string with regex?

If I have a string like this
Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg
how would I convert that into the following using regex:
newsflash-the-big-brown-dogs-brother-tj-ate-the-small-blue-egg
In other words, punctuation is discarded and spaces are replaced with hyphens.
It sounds like you want to create a "URL plug" -- a URL-friendly version of an article's title, for example. That means you'll want to make sure you remove all possible non-URL-friendly characters, not just a few. You might do it this way (in order):
Remove all non-letter non-number non-space characters by:
Replacing regex [^A-Za-z0-9 ] with the empty string "".
Replace all spaces with a dash by:
Replacing regex \s+ with the string "-".
Lower-case the string by:
Java s = s.toLowerCase();
JavaScript s = s.toLowerCase();
C# s = s.ToLowerCase();
Perl $s = lc($s);
Python s = s.lower()
PHP $s = strtolower($s);
Ruby s = s.downcase
Replace the regex [\s-]+ with "-", then replace [^\w-] with "".
Then, call ToLowerCase or equivalent.
In Javascript:
var s = "Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg";
alert(s.replace(/[\s+-]/g, '-').replace(/[^\w-]/g, '').toLowerCase());
Replace /\W+/ with '-', that will replace all non-word characters with a dash.
Then, collapse dashes by replacing /-+/ with '-'.
Then, lowercase the string - pure regex solutions cannot do that. You didn't say which language you are using, so I cannot give you an example, but your language might have String.toLowercase() or a tr/// call (tr/A-Z/a-z/, for example, in Perl).

I've got problem with fine tuning of regex

i've got regex which was alright, but as it camed out doesn't work well in some situations
Keep eye on message preview cause message editor do some tricky things with "\"
[\[]?[\^%#\$\*#\-;].*?[\^%#\$\*#\-;][\]]
its task is to find pattern which in general looks like that
[ABA]
A - char from set ^,%,#,$,*,#,-,;
B - some text
[ and ] are included in pattern
is expected to find all occurences of this pattern in test string
Black fox [#sample1#] [%sample2%] - [#sample3#] eats blocks.
but instead of expected list of matches
"[#sample1#]"
"[%sample2%]"
"[#sample3#]"
I get this
"[#sample1#]"
"[%sample2%]"
"- [#sample3#]"
And it seems that this problem will occur also with other chars in set "A". So could somebody suggest changes to my regex to make it work as i need?
and less important thing, how to make my regex to exclude patterns which look like that
[ABC]
A - char from set ^,%,#,$,*,#,-,;
B - some text
C - char from set ^,%,#,$,*,#,-,; other than A
[ and ] are included in pattern
for example
[$sample1#] [%sample2#] [%sample3;]
thanks in advance
MTH
\[([%#$*#;^-]).+?\1\]
applied to text:
Black fox [#sample1#] [%sample2%] - [#sample3#] [%sample4;] eats blocks.
matches
[#sample1#]
[%sample2%]
[#sample3#]
but not [%sample4;]
EDIT
This works for me (Output as expected, regex accepted by C# as expected):
Regex re = new Regex(#"\[([%#$*#;^-]).+?\1\]");
string s = "Black fox [#sample1#] [%sample2%] - [#sample3#] [%sample4;] eats blocks.";
MatchCollection mc = re.Matches(s);
foreach (Match m in mc)
{
Console.WriteLine(m.Value);
}
Why the first "?" in "[[]?"
\[[\^%#\$\*#\-;].*?[\^%#\$\*#\-;]\]
would detect your different strings just fine
To be more precise:
\[([\^%#\$\*#\-;])([^\]]*?)(?=\1)([\^%#\$\*#\-;])\]
would detect [ABA]
\[([\^%#\$\*#\-;])([^\]]*?)(?!\1)([\^%#\$\*#\-;])\]
would detect [ABC]
You have an optional matching of the opening square bracket:
[\]]?
For the second part of you question (and to perhaps simplify) try this:
\[\%[^\%]+\%\]|\[\#[^\#]+\#\]|\[\$[^\$]+\$\]
In this case there is a sub pattern for each possible delimiter. The | character is "OR", so it will match if any of the 3 sub expressions match.
Each subexpression will:
Opening bracket
Special Char
Everything that is not a special char (1)
Special char
Closing backet
(1) may need to add extra exclusions like ']' or '[' so it doesn't accidently match across a large body of text like:
[%MyVar#] blah blah [$OtherVar%]
Rob