perl regex partial word match - regex

I am trying to remove all words that contain two keys (in Perl).
For example, the string
garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2
Should become
garble variable10 variable1
To just remove the normal, unappended/prepended keys is easy:
$var =~ s/(vssx|vddx)/ /g;
However I cannot figure out how to get it to remove the entire xi_21_vssx part. I tried:
$var =~ s/\s.*(vssx|vddx).*\s/ /g
Which does not work correctly. I do not understand why... it seems like \s should match the space, then .* matches anything up to one of the patterns, then the pattern, then .* matches anything preceding the pattern until the next space.
I also tried replacing \s (whitespace) with \b (word boundary) but it also did it work. Another attempt:
$var =~ s/ .*(vssx|vddx).* / /g
$var =~ s/(\s.*vssx.*\s|\s.*vddx.*\s)/ /g
As well as a few other mungings.
Any pointers/help would be greatly appreciated.
-John

I think the regex will just be
$var =~ s/\S*(vssx|vddx)\S*/ /g;

You can use
\s*\S*(?:vssx|vddx)\S*\s*
The problem with your regex were:
The .* should have been non-greedy.
The .* in front of (vssx|vddx) mustn't match whitespace characters, so you have to use \S*.
Note that there's no way to properly preserve the space between words - i.e. a vssx b will become ab.
regex101 demo.

I am trying to remove all words that [...]
This type of problem lends itself well to grep, which can be used to find the elements in a list that match a condition. You can use split to convert your string to a list of words and then filter it like this:
use strict;
use warnings;
use 5.010;
my $string = 'garble variable10 variable1 vssx vddx xi_21_vssx vddx_garble_21 xi_blahvssx_grbl_2';
my #words = split ' ', $string;
my #filtered = grep { $_ !~ /(?:vssx|vddx)/ } #words;
say "#filtered";
Output:
garble variable10 variable1

Try this as the regex:
\b[\w]*(vssx|vddx)[\w]*\b

Related

I would like to use regex to insert specific characters in a regex expression?

I'd like to be able to use regex in Perl to insert characters into words.
So that the word "TABLE" would become "T%A%B%L%E%"
Can I ask for the syntax for such a feat?
Many thanks
Break the string into characters then join them with what you want in between; also append that
my $res = ( join '%', split //, $string ) . '%';
A simple-minded way with regex
$string =~ s/(.)/$1%/g;
where with /r modifier you can preserve $string and return the changed string instead
my $res = $string =~ s/(.)/$1%/gr;
You can use this command,
echo TABLE|perl -pe 's/\w/$&%/g'
This outputs T%A%B%L%E%
OR (in case your data is contained in a file)
perl -pe 's/\w/$&%/g' test.pl
You may replace \w with [a-zA-Z] if you just want to replace with alphabets as \w matchs alphabets numbers and underscore.
You can use look-behind also
my $s = "table";
$s=~s/(?<=.)/%/g;
print $s;
If your version >5.14 you can use \K
$s=~s/.\K/%/g;

How to group string of characters by 4?

I have string 1234567890 and I want to format it as 1234 5678 90
I write this regex:
$str =~ s/(.{4})/$1 /g;
But for this case 12345678 this does not work. I get excess whitespace at the end:
>>1234 5678 <<
I try to rewrite regex with lookahead:
s/((?:.{4})?=.)/$1 /g;
How to rewrite regex to fix that case?
Just use unpack
use strict;
use warnings 'all';
for ( qw/ 12345678 1234567890 / ) {
printf ">>%s<<\n", join ' ', unpack '(A4)*';
}
output
>>1234 5678<<
>>1234 5678 90<<
Context is your friend:
join(' ', $str =~ /(.{1,4})/g)
In list context, the match will all four character chunks (and anything shorter than that at the end of the string -- thanks to greediness). join will ensure the chunks are separated by spaces and there are no trailing spaces at the end.
If $str is huge and the temporary list increases the memory footprint too much, then you might just want to do the s///g and strip the trailing space.
My preference is for using the simplest possible patterns in regexes. Also, I haven't measured but with long strings, just a single chop might be cheaper than a conditional pattern in the s///g:
$ echo $'12345678\n123456789' | perl -lnE 's/(.{1,4})/$1 /g; chop; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<
You had the syntax almost right. Instead of just ?=., you need (?=.) (parens are part of the lookahead syntax). So:
s/((?:.{4})(?=.))/$1 /g
But you don't need the non-capturing grouping:
s/(.{4}(?=.))/$1 /g
And I think it is more clear if the capture doesn't include the lookahead:
s/(.{4})(?=.)/$1 /g
And given your example data, a non-word-boundary assertion works too:
s/(.{4})\B/$1 /g
Or using \K to automatically Keep the matched part:
s/.{4}\B\K/ /g
To fix the regex I should write:
$str =~ s/(.{4}(?=.))/$1 /g;
I should just add parentheses around ?=.. Without them ?=. is counted as non greed match followed by =.
So we match four characters and append space after them. Then I look ahead that there are still characters. For example, the regex will not match for string 1234
Just use a look ahead to see that you have at least one character remaining:
$ echo $'12345678\n123456789' | perl -lnE 's/.{4}\K(?=.{1})/ /g; say ">>$_<<"'
>>1234 5678<<
>>1234 5678 9<<

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

How do I search and replace with a regex and preserve blank spaces in Perl?

I'm using this code:
$text =~ s/\s(\w)/\u$1/g;
But This is an example
Become ThisIsAnExample
Instead of This Is An Example.
How to preserve blank spaces?
Use lookbehind.
$text =~ s/(?<!\S)(\w)/\u$1/g;
Or use the more efficient \K (Perl 5.10+).
$text =~ s/(?:^|\s)\K(\w)/\u$1/g;
Both of the solutions will make sure the first word is capitalized too. If that's not an issue, the second solution can be simplified to the following:
$text =~ s/\s\K(\w)/\u$1/g;
The matching contains the whitespace, the replacement doesn't.
$text =~ s/(\s)(\w)/$1\u$2/g;
Since \s contains different types of whitespace characters, if you want to keep it in your replacement, you need to capture it and put it back.
An alternative is to use word boundaries and full "words".
$text =~ s/\b(\w+)\b/\u$1/g;

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_