Evaluation Search and Replace in Perl - regex

I have a file formatted like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9558 9629 gene
locus_tag CeraR_t011
gene trnR-UCU
11296 9773 CDS
locus_tag CeraR_p012
gene atpA
product ATP synthase CF1 alpha subunit
transl_except (pos:complement(10268..10270), aa:Q)
transl_except (pos:complement(11192..11194), aa:Q)
transl_except (pos:complement(13267..13269), aa:M)
11296 9773 gene
locus_tag CeraR_p012
gene atpA
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I need to add 809 to both of the values following pos:complement in each instance. I have been attempting with the search and replace modifier as so:
$line =~ s!complement((\d+)..(\d+)!complement(($1+809)..($2+809)!eg
however, the ( after complement is always interpreted as part of an evaluation rather than simply a character. I have tried every combination of backslashes, apostrophes, and quotes to make it just a character but nothing seems to work.
Any advice would be appreciated

Since the replacement string is evaluated, you must use a quoted string and concatenations:
$line =~ s/complement\(\K(\d+)..(\d+)/($1+809) . '..' . ($2+809)/eg;
Note: since \K removes all on the left from the match result, you don't need to rewrite all the begining of the match in the replacement string.

Related

Replace a sequence of characters with a sequence of different characters of same length using regular expressions

I have a string which starts with spaces. I want to replace the leading spaces with equal number of dashes -. I don't want to replace any other spaces which may occur elsewhere in the string.
If I use /^\s*/-/, it only replaces with a single dash. If I use /^\s/-/, it only replaces the first space with a dash. If I remove the anchor /\s/-/, it replaces every occurences of space in the string which is not acceptable.
My string looks like this in general:
<n-leading-spaces><a-non-space-character><remaining-characters>
Example (pipes added to show the boundary):
| ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
After substitution (pipes added to show the boundary):
|---ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
NOTE: I cannot use any code snippet. I just want to know whether this can be done using just regex patterns. (Forgive my formatting as I'm new to markdown. I welcome formatting corrections)
You can use the following solution to replace a sequence of characters with a sequence of different characters of same length using regular expressions:
my $string = ' ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn ';
$string =~ s/^(\s+)/"-" x length($1)/eg;
print $string;
Returns '----ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn '

PowerShell Regex - word with wildcards and commas

Trying to do a replace on what I understand to be a simple operation but hitting a wall.
I can replace a word with a comma on the end:
$firstval = 'ssonp,RDPNP,LanmanWorkstation,webclient,MfeEpePcNP,PRNetworkProvider'
($firstval) -replace 'webclient+,',''
ssonp,RDPNP,LanmanWorkstation,MfeEpePcNP,PRNetworkProvider
But haven't been able to work out how to add a wildcard in the word, or how I'd have multiple words with wildcards proceeded by a comma, e.g.:
w* client+,* fee*, etc
(spaces added to stop being interpreted as formatting within the question)
Played with a few permeations and attempted to use examples from other questions without any luck.
The -replace operator takes a regular expression as its first parameter. You seem to be confusing wildcards and regular expressions. Your pattern w*client+,*fee*,, though a valid regular expression, seems to be intended to use wildcards.
The regular expression equivalent of the * wildcard is .*, where . means "any character" and * means "0 or more occurrences". Thus, the regular expression equivalent of w*client, would be w.*client,, and, similarly the regular expression equivalent of *fee*, would be .*fee.*,. Since the string to be searched has comma-separated values, however, we don't want our patterns to include "any character" (.*) but rather "any character but comma" ([^,]*). Therefore, the patterns to use become w[^,]*client, and [^,]*fee[^,]*,, respectively.
To search for both words in a string, separate the two patterns with |. The following builds such a pattern and tests it against strings with a match in various locations:
# Match w*client or *fee*
$wordPattern = 'w[^,]*client|[^,]*fee[^,]*';
# Match $wordPattern and at most one comma before or after
$wordWithAdjacentCommaPattern = '({0}),?|,({0})$' -f $wordPattern;
"`$wordWithAdjacentCommaPattern: $wordWithAdjacentCommaPattern";
# Replace single value
'webclient', `
# Replace first value
'webclient,middle,last', `
# Replace middle value
'first,webclient,last', `
# Replace last value
'first,middle,webclient' `
| ForEach-Object -Process { '"{0}" => "{1}"' -f $_, ($_ -replace $wordWithAdjacentCommaPattern); };
This outputs the following:
$wordWithAdjacentCommaPattern: (w[^,]*client|[^,]*fee[^,]*),?|,(w[^,]*client|[^,]*fee[^,]*)$
"webclient" => ""
"webclient,middle,last" => "middle,last"
"first,webclient,last" => "first,last"
"first,middle,webclient" => "first,middle"
A non-regex alternative you might consider would be to split your input string into individual values, filter out values that match certain wildcards, and reassemble what's left into comma-separated values:
(
'ssonp,RDPNP,LanmanWorkstation,webclient,MfeEpePcNP,PRNetworkProvider' -split ',', -1, 'SimpleMatch' `
| Where-Object { $_ -notlike 'w*client' -and $_ -notlike '*fee*'; } `
) -join ',';
By the way, you used the regular expression webclient+, to match and remove the text webclient, from your string (looks like the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\NetworkProvider\Order\ProviderOrder registry value). Just a note that, with the +, that will search for the literal text webclien followed by 1 or more occurrences of t followed by the literal text ,. Thus, that will match webclientt,, webclienttt,, webclientttttttttt,, etc. as well webclient,. If you are only interested in matching webclient, then you can just use the pattern webclient, (no +).

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.