Regex to match and replace a character in a pattern - regex

I would like to replace a character "?" with "fi" in a string.
I could write a generic str replace for this. But I want to replace the "?" only if it appears in between two A-Za-z character and avoid the rest
Eg., "Okay?" should be "Okay?" and not "Okayfi"
but
Modi?es should be Modifies since it has ? in middle
What have I tried?
sentence = re.sub(r"(\?)\b", "fi", sentence)
Please see here.
https://regexr.com/3nvk3
Seems to work fine in regexr. but doesnt work well in code. Am I doing something wrong?

The best approach here is to find the original text with the fi ligature and read it in with proper encoding.
Otherwise, you will have to use some workarounds.
You may use (?<=[a-zA-Z]) / (?=[A-Za-z]) lookarounds:
sentence = re.sub(r"(?<=[a-zA-Z])\?(?=[a-zA-Z])", "fi", sentence)
See the regex demo. The (?<=[a-zA-Z]) positive lookbehind matches a position immediately after an ASCII letter, and (?!=[A-Za-z]) positive lookahead matches a position immediately before an ASCII letter.
Or, you may also use a capturing group with backreferences:
sentence = re.sub(r"([a-zA-Z])\?([a-zA-Z])", r"\1fi\2", sentence)
See another regex demo. Note that \1 references the value captured with the first ([a-zA-Z]) group and \2 references the value captured into Group 2 (([a-zA-Z])).

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex recursion captured string

I have a problem with a regex that has to capture a substring that it's already captured...
I have this regex:
(?<domain>\w+\.\w+)($|\/|\.)
And I want to capture every subdomain recursively. For example, in this string:
test1.test2.abc.def
This expression captures test1.test2 and abc.def but I need to capture:
test1.test2
test2.abc
abc.def
Do you know if there is any option to do this recursively?
Thanks!
Maybe the following:
(\.|^)(?=(\w+\.\w+))
Go with capturing group 2
You can use a positive look ahead to capture the next group.
/(\w+)\.(?=(\w+))/g
Demonstration.
Edit: JvdV's regex is more correct.
Note that \w+ is will fail to match domains like regex-tester.com and will match invalid regex_tester.com. [a-zA-Z0-9-]+ is closer to correct. See this answer for a complete regex.
It's simpler and more robust to do this by splitting on . and iterating through the pieces in pairs. For example, in Ruby...
"test1.test2.abc.def".split(".").each_cons(2) { |a|
puts a.join(".")
}
test1.test2
test2.abc
abc.def
You may use a well-known technique to extract overlapping matches, but you can't rely on \b boundaries as they can match between a non-word / word char and word / non-word char. You need unambiguous word boundaries for left and right hand contexts.
Use
(?=(?<!\w)(?<domain>\w+\.\w+)(?!\w))
See the regex demo. Details:
(?= - a positive lookahead that enables testing each location in the string and capture the part of string to the right of it
(?<!\w) - a left-hand side word boundary
(?<domain>\w+\.\w+) - Group "domain": 1+ word chars, . and 1+ word chars
(?!\w) - a right-hand side word boundary
) - end of the outer lookahead.
Another approach is to use dots as word delimiters. Then use
(?=(?<![^.])(?<domain>[^.]+\.[^.]+)(?![^.]))
See this regex demo. Adjust as you see fit.

Matching comma after certain phrase

I'm using Atom's regex search and replace feature and not JavaScript code.
I thought this JavaScript-compatible regex would work (I want to match the commas that have Or rather behind it):
(?!\b(Or rather)\b),
?! = Negative lookahead
\b = word boundary
(...) = search the words as a whole not character by character
\b = word boundary
, = the actual character.
However, if I remove characters from "Or rather" the regex still matches. I'm confused.
https://regexr.com/4keju
You probably meant to use positive lookbehind instead of negative lookbehind
(?<=\b(Or rather)\b),
Regex Demo
You can activate lookbehind in atom using flags, Read this thread
The (?!\b(Or rather)\b), pattern is equal to , as the negative lookahead always returns true since , is not equal to O.
To remove commas after Or rather in Atom, use
Find What: \b(Or rather),
Replace With: $1
Make sure you select the .* option to enable regular expressions (and the Aa is for case sensitivity swapping).
\b(Or rather), matches
\b - a word boundary
(Or rather) - Capturing group 1 that matches and saves the Or rather text in a memory buffer that can be accessed using $1 in the replacement pattern
, - a comma.
JS regex demo:
var s = "Or rather, an image.\nor rather, an image.\nor rather, friends.\nor rather, an image---\nOr rather, another time they.";
console.log(s.replace(/\b(Or rather),/g, '$1'));
// Case insensitive:
console.log(s.replace(/\b(Or rather),/gi, '$1'));
To Match any comma after "Or rather" you can simply use
(or rather)(,) and access the second group using match[2]
Or an alternative would be to use or rather as a non capturing group
(?:or rather)(,) so the first group would be commas after "Or rather"

Regex pattern for localization

I am trying to find a regex pattern to fix a localize issue.
The usual delimiters are "." "," or "_" which i have stored into an array of delimiters.
I'm trying to find a pattern with match any of these delimiters which also ends with one or more 0.
For example 3,000 or 3,0 3.0 3.00
You could try positive lookahead
If indeed your data always has one or more 0 after any delimiter, using a positive lookahead ( (?=0+) in this case) might be what you are looking for...
More precisely, for the numbers you gave:
s/([_.,](?=0+))/g
should do the trick!
You could try it out and experiment with regex here!
We could likely start with an expression similar to:
\d+[.,](\d+)?[0]
and add additional boundaries to it, if we like so.
For instance, if we wish to capture the delimiters, we would be adding a capturing group:
\d+([.,])(\d+)?[0]
Demo
Or if we wish to remove delimiters, we would expand it to:
(\d+)([.,])(\d+)?([0])
and replace it with:
$1$3$4
Demo
Test
const regex = /(\d+)([.,])(\d+)?([0])/gm;
const str = `3,000
3,0
3.0
3.00`;
const subst = `$1$3$4`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
You could use add the delimiters in a character class [_.,] and use word boundaries \b to prevent the number being part of a larger word.
If thet are the only value, you might also use anchors to assert the start ^ and the ends $ of the string.
\b\d+[_.,]\d*0\b
That will match:
\b Word boundary
\d+ Match 1+ digits
[_.,] Match any of the listed in the character class
\d*0 Match 0+ digits followed by a zero
\b Word boundary
Regex demo

Regexp, that ignores only first capture group

We have tab spaced list of "key=value" pairs.
How we can split it, using regexp?
Case key=value must be transformed into value. Case key=value=value2 must be transformed into value=value2.
https://regex101.com/r/dR5dT0/1 - I've started solution like this, but can't find beautiful way to remove only "key=" part from text.
UPD BTW, do you know cool crash courses on regular expressions?
You can just use
=(\S*)
See regex demo
Since the list is already formatted, the = in the pattern will always be the name/value delimiter.
The \S matches any non-whitespace character.
The * is a quantifier meaning that the \S should occur zero or more times (\S* matches zero or more non-whitespace characters).
You can use this regex for matching:
/\w+=(\S+)/
and grab captured group #1
RegEx Demo