PHP, Incosistent results with preg_replace (RegEx) - regex

I'm looking for a nudge in the right direction because the results I get from preg_replace don't make sense to me.
I've the following RegEx:
/([a-zA-Z0-9]{1,})/([a-zA-Z0-9]{1,})/([a-zA-Z0-9_]{1,})/([a-zA-Z0-9_]{1,})/([a-zA-Z0-9_]{1,})\b
I've a file which consists of lines like this:
1:
*/tdn/quota/plot_3/boot_tdd_8/Homes_Homes1/boot/bplsed/ruc001/No Files/pl1/Cookies/MMTException/container.rig,11/12/2017,29/11/2017,29/11/2017*
2:
*/vdm/quota/plot_1/boot_tdd_1/Homes_Homes2/.etc/nonrig_tile_edit.vids,07/08/2014,07/08/2014,07/08/2014*
3:
*/vdm/quota/plot_5/boot_tdd_3/Homes_Homes1/boot/int/rlt111/pl1/Cookies/container.rig,19/11/2019,13/11/2017,13/11/2017*
My goal is to only keep everything after the /Homes_Homes/ part.
I get the correct replacement for the first file path with my Regex:
*/boot/bplsed/ruc001/No Files/pl1/Cookies/MMTException/container.rig,11/12/2017,29/11/2017,29/11/2017*
The second file path is also correct:
*/.etc/nonrig_tile_edit.vids,07/08/2014,07/08/2014,07/08/2014*
However, for the last file path I get:
*/container.rig*
instead of
*/boot/int/rlt111/pl1/Cookies/container.rig,19/11/2019,13/11/2017,13/11/2017*
Why does preg_replace fail with the third file path?

The reason you get that last result is because preg_replace will replace all the matches and in the last example string the pattern matches twice.
What you might do is set the 4th parameter $limit to 1 to do a single replacement.
Not all the character classes in your pattern match an underscore, but if it would be ok to do so, you might shorten the pattern using a quantifier {5}, and anchor ^ to assert the start of the srting and make use of \K to match and then forget the * at the start of the string.
^\*\K(?:/\w+){5}
Regex demo | Php demo
For example
$re = '~^\*\K(?:/\w+){5}~';
$strings = [
"*/tdn/quota/plot_3/boot_tdd_8/Homes_Homes1/boot/bplsed/ruc001/No Files/pl1/Cookies/MMTException/container.rig,11/12/2017,29/11/2017,29/11/2017*",
"*/vdm/quota/plot_1/boot_tdd_1/Homes_Homes2/.etc/nonrig_tile_edit.vids,07/08/2014,07/08/2014,07/08/2014*",
"*/vdm/quota/plot_5/boot_tdd_3/Homes_Homes1/boot/int/rlt111/pl1/Cookies/container.rig,19/11/2019,13/11/2017,13/11/2017*"
];
foreach ($strings as $s) {
echo preg_replace($re, '', $s) . PHP_EOL;
}
Output
*/boot/bplsed/ruc001/No Files/pl1/Cookies/MMTException/container.rig,11/12/2017,29/11/2017,29/11/2017*
*/.etc/nonrig_tile_edit.vids,07/08/2014,07/08/2014,07/08/2014*
*/boot/int/rlt111/pl1/Cookies/container.rig,19/11/2019,13/11/2017,13/11/2017*

Related

Regex - Exclude some string after matches [duplicate]

I need to extract from a string a set of characters which are included between two delimiters, without returning the delimiters themselves.
A simple example should be helpful:
Target: extract the substring between square brackets, without returning the brackets themselves.
Base string: This is a test string [more or less]
If I use the following reg. ex.
\[.*?\]
The match is [more or less]. I need to get only more or less (without the brackets).
Is it possible to do it?
Easy done:
(?<=\[)(.*?)(?=\])
Technically that's using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:
is preceded by a [ that is not captured (lookbehind);
a non-greedy captured group. It's non-greedy to stop at the first ]; and
is followed by a ] that is not captured (lookahead).
Alternatively you can just capture what's between the square brackets:
\[(.*?)\]
and return the first captured group instead of the entire match.
If you are using JavaScript, the solution provided by cletus, (?<=\[)(.*?)(?=\]) won't work because JavaScript doesn't support the lookbehind operator.
Edit: actually, now (ES2018) it's possible to use the lookbehind operator. Just add / to define the regex string, like this:
var regex = /(?<=\[)(.*?)(?=\])/;
Old answer:
Solution:
var regex = /\[(.*?)\]/;
var strToMatch = "This is a test string [more or less]";
var matched = regex.exec(strToMatch);
It will return:
["[more or less]", "more or less"]
So, what you need is the second value. Use:
var matched = regex.exec(strToMatch)[1];
To return:
"more or less"
Here's a general example with obvious delimiters (X and Y):
(?<=X)(.*?)(?=Y)
Here it's used to find the string between X and Y. Rubular example here, or see image:
You just need to 'capture' the bit between the brackets.
\[(.*?)\]
To capture you put it inside parentheses. You do not say which language this is using. In Perl for example, you would access this using the $1 variable.
my $string ='This is the match [more or less]';
$string =~ /\[(.*?)\]/;
print "match:$1\n";
Other languages will have different mechanisms. C#, for example, uses the Match collection class, I believe.
[^\[] Match any character that is not [.
+ Match 1 or more of the anything that is not [. Creates groups of these matches.
(?=\]) Positive lookahead ]. Matches a group ending with ] without including it in the result.
Done.
[^\[]+(?=\])
Proof.
http://regexr.com/3gobr
Similar to the solution proposed by null. But the additional \] is not required. As an additional note, it appears \ is not required to escape the [ after the ^. For readability, I would leave it in.
Does not work in the situation in which the delimiters are identical. "more or less" for example.
Most updated solution
If you are using Javascript, the best solution that I came up with is using match instead of exec method.
Then, iterate matches and remove the delimiters with the result of the first group using $1
const text = "This is a test string [more or less], [more] and [less]";
const regex = /\[(.*?)\]/gi;
const resultMatchGroup = text.match(regex); // [ '[more or less]', '[more]', '[less]' ]
const desiredRes = resultMatchGroup.map(match => match.replace(regex, "$1"))
console.log("desiredRes", desiredRes); // [ 'more or less', 'more', 'less' ]
As you can see, this is useful for multiple delimiters in the text as well
PHP:
$string ='This is the match [more or less]';
preg_match('#\[(.*)\]#', $string, $match);
var_dump($match[1]);
This one specifically works for javascript's regular expression parser /[^[\]]+(?=])/g
just run this in the console
var regex = /[^[\]]+(?=])/g;
var str = "This is a test string [more or less]";
var match = regex.exec(str);
match;
To remove also the [] use:
\[.+\]
I had the same problem using regex with bash scripting.
I used a 2-step solution using pipes with grep -o applying
'\[(.*?)\]'
first, then
'\b.*\b'
Obviously not as efficient at the other answers, but an alternative.
I wanted to find a string between / and #, but # is sometimes optional. Here is the regex I use:
(?<=\/)([^#]+)(?=#*)
Here is how I got without '[' and ']' in C#:
var text = "This is a test string [more or less]";
// Getting only string between '[' and ']'
Regex regex = new Regex(#"\[(.+?)\]");
var matchGroups = regex.Matches(text);
for (int i = 0; i < matchGroups.Count; i++)
{
Console.WriteLine(matchGroups[i].Groups[1]);
}
The output is:
more or less
If you need extract the text without the brackets, you can use bash awk
echo " [hola mundo] " | awk -F'[][]' '{print $2}'
result:
hola mundo

Regex Replace anything that does not match capturing group

I have a String like following: [Monster:Test]Maps=1,5,2,3[Monster:Test2]Maps=2-5
I need to replace the string of unnecessary text.
The only text I want to keep is the brackets including the text between the brackets. So only [Monster:Test] and [Monster:Test2] should be kept.
So my regex to find it is: \\[(.*)\\]
I don't understand how to replace anything that does not match my group.
How about using preg_match_all
$s = "[Monster:Test]Maps=1,5,2,3[Monster:Test2]Maps=2-5~";
preg_match_all("/\[[^\]]+\]/", $s, $m);
echo implode($m[0]);
Results into:
[Monster:Test][Monster:Test2]
Does this work as required?
Just join all matches and you should end up with what you want.
/\[[^\]]+\]/g;
matches the first [
matches anything that is not a ]
matches a ]
g flag for all matches
Implementation Example:
var string = "[Monster:Test]Maps=1,5,2,3[Monster:Test2]Maps=2-5";
var result = string.match(/\[[^\]]+\]/g).join("");
console.log(result);
* Although the example is javascript, you should be able to do this in any other language.

Tcl regexp does not escape asterisk (*)

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?
From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}
I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Regular Expression to find a string included between two characters while EXCLUDING the delimiters

I need to extract from a string a set of characters which are included between two delimiters, without returning the delimiters themselves.
A simple example should be helpful:
Target: extract the substring between square brackets, without returning the brackets themselves.
Base string: This is a test string [more or less]
If I use the following reg. ex.
\[.*?\]
The match is [more or less]. I need to get only more or less (without the brackets).
Is it possible to do it?
Easy done:
(?<=\[)(.*?)(?=\])
Technically that's using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:
is preceded by a [ that is not captured (lookbehind);
a non-greedy captured group. It's non-greedy to stop at the first ]; and
is followed by a ] that is not captured (lookahead).
Alternatively you can just capture what's between the square brackets:
\[(.*?)\]
and return the first captured group instead of the entire match.
If you are using JavaScript, the solution provided by cletus, (?<=\[)(.*?)(?=\]) won't work because JavaScript doesn't support the lookbehind operator.
Edit: actually, now (ES2018) it's possible to use the lookbehind operator. Just add / to define the regex string, like this:
var regex = /(?<=\[)(.*?)(?=\])/;
Old answer:
Solution:
var regex = /\[(.*?)\]/;
var strToMatch = "This is a test string [more or less]";
var matched = regex.exec(strToMatch);
It will return:
["[more or less]", "more or less"]
So, what you need is the second value. Use:
var matched = regex.exec(strToMatch)[1];
To return:
"more or less"
Here's a general example with obvious delimiters (X and Y):
(?<=X)(.*?)(?=Y)
Here it's used to find the string between X and Y. Rubular example here, or see image:
You just need to 'capture' the bit between the brackets.
\[(.*?)\]
To capture you put it inside parentheses. You do not say which language this is using. In Perl for example, you would access this using the $1 variable.
my $string ='This is the match [more or less]';
$string =~ /\[(.*?)\]/;
print "match:$1\n";
Other languages will have different mechanisms. C#, for example, uses the Match collection class, I believe.
[^\[] Match any character that is not [.
+ Match 1 or more of the anything that is not [. Creates groups of these matches.
(?=\]) Positive lookahead ]. Matches a group ending with ] without including it in the result.
Done.
[^\[]+(?=\])
Proof.
http://regexr.com/3gobr
Similar to the solution proposed by null. But the additional \] is not required. As an additional note, it appears \ is not required to escape the [ after the ^. For readability, I would leave it in.
Does not work in the situation in which the delimiters are identical. "more or less" for example.
Most updated solution
If you are using Javascript, the best solution that I came up with is using match instead of exec method.
Then, iterate matches and remove the delimiters with the result of the first group using $1
const text = "This is a test string [more or less], [more] and [less]";
const regex = /\[(.*?)\]/gi;
const resultMatchGroup = text.match(regex); // [ '[more or less]', '[more]', '[less]' ]
const desiredRes = resultMatchGroup.map(match => match.replace(regex, "$1"))
console.log("desiredRes", desiredRes); // [ 'more or less', 'more', 'less' ]
As you can see, this is useful for multiple delimiters in the text as well
PHP:
$string ='This is the match [more or less]';
preg_match('#\[(.*)\]#', $string, $match);
var_dump($match[1]);
This one specifically works for javascript's regular expression parser /[^[\]]+(?=])/g
just run this in the console
var regex = /[^[\]]+(?=])/g;
var str = "This is a test string [more or less]";
var match = regex.exec(str);
match;
To remove also the [] use:
\[.+\]
I had the same problem using regex with bash scripting.
I used a 2-step solution using pipes with grep -o applying
'\[(.*?)\]'
first, then
'\b.*\b'
Obviously not as efficient at the other answers, but an alternative.
I wanted to find a string between / and #, but # is sometimes optional. Here is the regex I use:
(?<=\/)([^#]+)(?=#*)
Here is how I got without '[' and ']' in C#:
var text = "This is a test string [more or less]";
// Getting only string between '[' and ']'
Regex regex = new Regex(#"\[(.+?)\]");
var matchGroups = regex.Matches(text);
for (int i = 0; i < matchGroups.Count; i++)
{
Console.WriteLine(matchGroups[i].Groups[1]);
}
The output is:
more or less
If you need extract the text without the brackets, you can use bash awk
echo " [hola mundo] " | awk -F'[][]' '{print $2}'
result:
hola mundo