REXEX match a String and select up to a char - regex

I am trying to create a regular expression where I can match the initial of the string and then replace anything after an specific char.... like for example:
String = 123456789:0:0 => Output = 123456789:2:4
I need a regex where it need to match "123" in the begging then replace only "0:0" by another String.
to match "123" is easy: ^123, but I cannot find a way after this to go up to : and replace only the rest of string.
I would appreciate any help.

You can use a negated character class to match up till the first occurrence of a colon.
In the replacement use capture group 1, followed by the replacement.
^(123[^:]*:)0:0$
^ Start of string
(123[^:]*:) Match 123 followed by 0+ times any char except : using a negated character class
0:0 Match literally
$ End of string
Regex demo
If you want to replace all after matching the first colon, you could use .+
^(123[^:]*:).+
See another regex demo

Without knowing your exact programming context it seems like you're using a version of a regex_replace function. With suitable grouping this is easy to do.
Don't think of what you want to replace. Think about what you want to keep.
^(.*?123.*?:)(0:0)(.*?)$
And as your replacement string use
$12:4$3

For replacing the "0:0", you can use:
Numbers between 123 and 0:0 => /^(123[0-9]+:)0:0/ replace ${1}2:4
OR
Anything between 123 and 0:0 => /^(123.+:)0:0/ replace ${1}2:4
This RegExp creates a group starting with 123 until it reaches 0:0 and later we use the group in the replacement string. This however depends on the programming language you use:
Example in PHP
$result = preg_replace("/^(123[0-9]+:)0:0/", "\${1}2:4", "123456789:0:0");
Example in JavaScript
let result = "123456789:0:0".replace(/^(123[0-9]+:)0:0/, "$12:4");

With JavaScript:
const str = `123456789:0:0`;
console.log(str.replace(/(?<=^123[^:]*:).*/, `2:4`));
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
123 '123'
--------------------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

Related

RegEx expression to keep first appearance of word grouping

I have the following RegEx (\[[^]]+])(?=.*\1)
which identifies the first set of appearances of a duplicate word group inside a string (each word group is enclosed between [ ] brackets). However, I am trying to come up with a RegEx that identifies the last set of appearances of a duplicate word group. Reason being, I need to remove duplicate word groups while retaining the order in which each group appears in the overall string.
Using the following string as an example whereby only [John Smith] and [Jane Doe] are duplicate word groups:
[John Smith][John Smith][Mr. Smith][Jane Doe][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]
After using my RegEx in a RegEx Replace formula, I get the below:
[Mr. Smith][Mrs. Doe][John Smith][Jane Doe][Doe][John][Smith John][John Smith Sr]"
However, I need my RegEx Replace formula to give me:
[John Smith][Mr. Smith][Jane Doe][Mrs. Doe][Doe][John][Smith John][John Smith Sr]
I have tried many ways to achieve the latter with no luck. Thanks in advance.
Considering infinite-width lookbehinds:
(\[[^\][]+])(?<=\1.*\1)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
[^\][]+ any character except: '\]', '[' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
) end of look-behind
Many regex engines do not support variable-length lookbehinds, but most support variable-length lookaheads. When working with an engine that supports variable-length lookaheads, but not variable-length lookbehinds, one approach that often works is to reverse the string, modify the (reversed) string with a regex and then reverse the resulting string. That approach could be used here.
Suppose, for example, the string were
"[John Smith][John Smith][Mr. Smith][Jane Doe][John Smith][Jane Doe][Doe]"
Reversing the string produces
"]eoD[]eoD enaJ[]htimS nhoJ[]eoD enaJ[]htimS .rM[]htimS nhoJ[]htimS nhoJ["
We now convert matches of the following expression to empty strings:
(\][^[]*\[)(?=.*\1)
which produces
"]eoD[]eoD enaJ[]htimS .rM[]htimS nhoJ["
Demo
Lastly, we reverse that string to obtain
"[John Smith][Mr. Smith][Jane Doe][Doe]"
The regular expression can be written in free-spacing mode to make it self-documenting:
( # begin capture group 1
\] # match ']'
[^[]* # match one or more (as many as possible) chars other than '['
\[ # match '['
) # end capture group 1
(?= # begin a positive lookahead
.* # match one or more (as many as possible) chars
\1 # match the content of capture group 1
) # end the positive lookahead
At first glance this may seem a kludge, but since reversing a string is so easy in any language it does provide a useful work-around in some cases. Mind you, doing what you want to do here in code is pretty easy in most languages. In Ruby, for example, you could write (str being a variable holding the string)
str.scan(/\[[^\]]*\]/).uniq.join

Get first match in closing part of regex

I must take string with regex who got string "[%" "%]" and any text or "" inside this. As example:
Input: dsafsdfadsaffsdadsaffadsaf[%sadsad[%]%%]fdfsadfsad%]fsasdf
Output: [%sadsad[%]
I already wrote expression - \[%(.\n*)*%\], but it takes last of %].
Output: [%sadsad[%]%%]fdfsadfsad%]
Did anyone know how get first of closing match?
Put . and \n inside a capturing or non-capturing group delimited by a logical OR | operator, and make it as non-greedy.
\[%(.|\n)*?%\]
OR
You could do like the below.
\[%[\S\s]*?%\]
[\S\s]*? Matches any space or non-space character non-greedily.
\[%[^\]]*%\]
You can try this to get string upto first closng %].See demo.
https://regex101.com/r/gX5qF3/5
NODE EXPLANATION
--------------------------------------------------------------------------------
\[% '['%
--------------------------------------------------------------------------------
[^\]]* any character except: '\]' (0 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
%\] %']'

cryptic perl expression

I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?
Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.
You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.
/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.
While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!
This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match
The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine

Why does "Year 2010" =~ /([0-4]*)/ results in empty string in $1?

If I run
"Year 2010" =~ /([0-4]*)/;
print $1;
I get empty string.
But
"Year 2010" =~ /([0-4]+)/;
print $1;
outputs "2010". Why?
You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.
Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.
The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.
The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.
The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.
you can also use YAPE::Regex::Explain for explanation of a regular expression like
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();
output:
The regular expression:
(?-imsx:([0-4]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]* any character of: '0' to '4' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The regular expression:
(?-imsx:([0-4]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]+ any character of: '0' to '4' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.
The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.
It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)
We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.
The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.
To make your first RE match, use the anchor '$':
"Year 2010" =~ /([0-4]*)$/;
print $1;

Trying to match what is before /../ but after / with regular expressions

I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'