Wildcard beginning of a line in perl - regex

How to use wildcard for beginning of a line?
Example, I want to replace abc with def.
This is what my file looks like
abc
abc
abc
hg abc
Now I want that abc should be replaced in only first 3 lines. How to do it?
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
What condition to be put before beginning of first space?
Thanks

What about:
s/(^ *)abc/$1def/g
(^ *) -> zero or morespaces at start of line
This will strictly replace abc with def.
Also note I've used a real space and not \s because you said "beginning of first space". \s matches more characters than only space.

You are making a couple of mistakes in your regex
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
You don't need /g (global, match as many times as possible) if you only want to replace from the beginning of the string (since that can only match once).
Inside a character class bracket all characters are literal except ], - and ^, so ['\s'] means "match whitespace or apostrophe '"
Spaces inside the regex is interpreted literally, unless the /x modifier is used (which it is not)
Quantifiers apply to whatever they immediately precede, so \s* means "zero or more whitespace", but \s * means "exactly one whitespace, followed by zero or more space". Again, unless /x is used.
You do not need to supply $_ =~, since that is the variable any regex uses unless otherwise specified.
If you want to replace abc, and only abc when it is the first non-whitespace in a line, you can do this:
s/^\s*\Kabc/def/
An alternate for the \K (keep) escape is to capture and put back
s/^(\s*)abc/$1def/
If you want to keep the whitespace following the target string abc, you do not need to do anything. If you want it removed, just add \s* at the end
s/^\s*\Kabc\s*/def/
Also note that this is simply a way to condense logic into one statement. You can also achieve the same by using very simple building blocks:
if (/^\s*abc/) { # if abc is the first non-whitespace
s/abc/def/; # ...substitute it
}
Since the substitution only happens once (if the /g modifier is not used), and only the first match is affected, this will flawlessly substitute abc for def.

Try this:
$_ =~ s/^['\s'] * abc ['\s'] * /def/g;
If you need to check from start of a line then use ^.
Also, I am not sure why you have ' and spaces in your regex. This should also work for you:
$_ =~ s/^[\s]*abc[\s]*/def/g;

Use ^ character, and remove unnecessary apostrophes, spaces and [ ] :
$_ =~ s/^\s*abc/def/g
If you want to keep those spaces that were before the "abc":
$_ =~ s/^(\s*)abc/\1def/g

Related

perl regex to remove initial all-whitespace lines from a string: why does it work?

The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string.
It leaves everything else alone, including any whitespace that might begin the first visible line.
By "visible line," I mean a line that satisfies /\S/.
The code below demonstrates this.
But how does it work?
\A anchors the start of the string
\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not?
See
https://perldoc.perl.org/perlre.
Suppose that without the (?s) modifier it nevertheless "treats the string as a single line".
Then I would expect the greedy \s* to grab every whitespace character it sees,
including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?
#!/usr/bin/env perl
use strict; use warnings;
print $^V; print "\n";
my #strs=(
join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
join('',
"\n",
"\n\t\t\x20",
"\n\t\t\x20",
'......so what?',
"\n\t\t\x20",
),
);
my $count=0;
for my $onestring(#strs)
{
$count++;
print "\n$count ------------------------------------------\n";
print "|$onestring|\n";
(my $try1=$onestring)=~s/\A\s*\n//;
print "|$try1|\n";
}
But how does it work?
...
I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.
Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.
And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.
This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms (like \w++ etc), or rather by extended construct (?>...).
But without the (?s) modifier, it should stop at the end of the first line, should it not?
Here you may be confusing \s with ., which indeed does not match \n (without /s)
There are two questions here.
The first is about the interaction of \s and (lack of) (?s). Quite simply, there is no interaction.
\s matches whitespaces characters, which includes Line Feed (LF). It's not affected by (?s) whatsoever.
(?s) exclusively affects ..
(?-s) causes . to match all characters except LF. [Default]
(?s) causes . to match all characters.
If one wanted to match whitespace on the current line, one could use \h instead of \s. It only matches horizontal whitespace, thus excluding CR and LF (among others).
Alternatively, (?[ \s - \n ])[1], [^\S\n][2] and \s(?<!\n)[3] all match whitespace characters other than LF.
The second is about a misconception of what greediness means.
Greediness or lack thereof doesn't affect if a pattern can match, just what it matches. For example, for a given input, /a+/ and /a+?/ will both match, or neither will match. It's impossible for one to match and not the other.
"aaaa" =~ /a+/ # Matches 4 characters at position 0.
"aaaa" =~ /a+?/ # Matches 1 character at position 0.
"bbbb" =~ /a+/ # Doesn't match.
"bbbb" =~ /a+?/ # Doesn't match.
When something is greedy, it means it will match the most possible at the current position that allows the entire pattern to match. Take the following for example:
"ccccd" =~ /.*d/
This pattern can match by having .* match only cccc instead of ccccd, and thus does so. This is achieved through backtracking. .* initially matches ccccd, then it discovers that d doesn't match, so .* tries matching only cccc. This allows the d and thus the entire pattern to match.
You'll find backtracking used outside of greediness too. "efg" =~ /^(e|.f)g/ matches because it tries the second alternative when it's unable to match g when using the first alternative.
In the same way as .* avoids matching the d in the earlier example, the \s* avoids matching the LF and tab before dog in your example.
Requires use experimental qw( regex_sets ); before 5.36, but it was safe to use since 5.18 as it was accepted without change since its introduction as an experimental feature..
Less clear because it uses double negatives.[^\S\n]= A char that's ( not( not(\s) or LF ) )= A char that's ( not(not(\s)) and not(LF) )= A char that's ( \s and not LF )
Less efficient, and far from as pretty as the regex set.

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.

$cmd =~ s#-fp [^ ]+##; What does it mean in Perl?

$cmd =~ s#-fp [^ ]+##;
Is there anyone who let me know what this regex means in Perl?
I couldn't find any regex like above through googling...
This removes the -fp optional parameter and its value from the command.
This takes the string stored by variable $cmd and replaces a section matching -fp [^ ]+ with nothing.
This command is employing the fact that Perl subsitution (or other regex modifiers) can have any delimiter character. What is normally written as s/.../.../ is s#...#...# here. That may be the source of confusion.
=~ is a binary binding operator which takes the left argument as the string to perform the right argument argument on, in this case a substitution.
-fp [^ ]+
-fp matches literally.
[^ ]+ matches one or more characters which are not space.
Let's get the easy bit out of the way first. The $cmd =~ simply means "do the substitution on the variable $cmd".
Not all of this expression is a regex. It's actually the substitution operator - s/REGEX/STRING/. It matches the REGEX and replaces it with the STRING.
Like many similar operators in Perl, the substitution operator allows you to choose the delimiter character that you use. In this case, the programmer has made the slightly bizarre choice to use #.
So, we have this:
$cmd =~ s/-fp [^ ]+//;
And we now know that it means. "Match the variable $cmd against the regex -fp [^ ]+ and replace it with an empty string". Why an empty string? Because the replacement string bit (between the second and third /) is an empty string.
All we need to do now is to understand the actual regex - -fp [^ ]+. And it's not very complicated.
-fp - the first four characters (up to and including the space) match themselves. So this matches the literal string "-fp ".
[^ ] - this is a "character class". Normally, it means "match any of the characters inside [...]". But the ^ at the start inverts that meaning to "match any characters expect the ones between [^...]. So this is match anything that isn't a space.
+ - this is a modifier that means "match one or more of the previous expression".
So, put together, this is "match the string '-fp ' followed by one or more non-space characters.
And, adding in the rest of the expression, we get:
Look at the string in $cmd, if you find the string '-fp -' followed by one or more non-space characters, then replace the matched portion with an empty string.

How does Perl match annotation "//" for verilog files?

I have found one method, but I don't understand the principle:
#remove lines starting with //
$file =~ s/(?<=\n)[ \t]*?\/\/.*?\n//sg;
How does (?<=\n)[ \t]*? work?
The critical piece is the lookbehind (?<=...). It is a zero-width assertion, what means that it does not consume its match -- it only asserts that the pattern given inside is indeed in the string, right before the pattern that follows it.
So (?<=\n)[ \t] matches either a space or a tab, [ \t], that has a newline before it. With the quantifier, [ \t]*, it matches a space-or-tab any number of times (possibly zero). Then we have the // (each escaped by \). Then it matches any character any number of times up to the first newline, .*?\n.
Here ? makes .* non-greedy so that it stops at the first match of the following pattern.
This can be done in other ways, too.
$file =~ s{ ^ \s* // .*? \n }{}gmx
The modifier m makes anchors ^ and $ (unused here) match the beginning and end of each line. I use {}{} as delimiters so that I don't have to escape /. The modifier x allows use of spaces (and comments and newlines) inside for readability.
You can also do it by split-ing the string by newline and passing lines through grep
my $new_file = join '\n', grep { not m|^\s*//.*| } split /\n/, $file;
The split returns a list of lines and this is input for grep, which passes those for which the code in the block evaluates to true. The list that it returns is then joined back, if you wish to again have a multiline string.
If you want lines remove join '\n' and assign to an array instead.
The regex in the grep block is now far simpler, but the whole thing may be an eye-full in comparison with the previous regex. However, this approach can turn hard jobs into easy ones: instead of going for a monster master regex, break the string and process the pieces easily.

perl regex - quantifier * not greedy enough to pickup the newline at end of string

Is it not quantifier * , greedy ? Should not \s* match 0 or more occurence of white spaces,and which in turn would match everything till end of the given input string ?
#!/usr/bin/perl
use strict;
use warnings;
my $input="Name : www.devserver.com\n";
$input=~s/\w+.:\s*//; # /s* should not it match everthing till \n at the end ?
print $input;
Please help me understand this behaviour.
\s* will match only a string consisting entirely of characters of the same class (namely, whitespace).
In your case, there is www.devserver.com between the leading and trailing spaces.
You may have tried to use . class instead of \s:
$input=~s/\w+.:.*//;
This also wouldn't touch the trailing newline! According to perlre:
To simplify multi-line substitutions, the "." character never matches a newline unless you use the /s modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't.
So, wrapping it up: the behavior you are expecting can be reproduced with the following substitution:
$input=~s/\w+.:.*//s;