Using the length of the matched group inside regex

Using the length of the matched group inside regex - regex

Assume this
char=l
string="Hello, World!"
Now, I want to replace all char in string but continuous occurrence (run-length encoding) while reading from STDIN
I tried this:
$c=<>;$_=<>;print s/($c)\1*/length($&)/grse;
When the input is given as
l
Hello, World!
It returns Hello, World!. But when I ran this
$c=<>;$_=<>;print s/(l)\1*/length($&)/grse;
it returned He2o, Wor1d.
So, since the input is given in separate lines, $c contained \n (checked with $c=~/\n/)
So, I tried
$c=<>.chomp;$_=<>;print s/($c)\1*/length($&)/grse;
and
$c=<>;$_=<>;print s/($c.chomp)\1*/length($&)/grse;
Neither worked. Could anyone please say why?

In Perl, . is used to concatenate strings, and not to call methods (unlike in some other languages; Ruby for instance). Have a look at documentation of chomp to see how it should be use. You should be doing
chomp($c=<>)
Rather than
$c=<>.chomp
Your full code should thus simply be:
chomp($c=<>);$_=<>;print s/($c)\1*/length($&)/grse;
If $c is always a single character, then the regex can be simplified to s/$c+/length($&)/grse. Also, if $c can be a regex meta-character (eg, +, *, (, [, etc), then it you should escape it (and it makes sense to escape it just in case). To do so, you can use \Q..\E (or quotemeta, although it is more verbose and thus maybe less adapted to a one-liner):
s/\Q$c\E+/length($&)/grse
If you don't escape $c one way or another, and your one-liner is ran with ( as first input for instance, you'll get the following error:
Quantifier follows nothing in regex; marked by <-- HERE in m/(+ <-- HERE / at -e line 1, <> line 2
Regarding what $c=<>.chomp actually means in Perl (since this is a valid Perl code that can make sense in some contexts):
$c=<>.chomp means <> concatenated to chomp, where chomp without arguments is understood as chomp($_). And chomp returns the total number of characters removed, and since $_ is empty, no characters are removed, which means that this chomp returns 0. So you are basically writing $c=<>.0, which means that if your input is l\n, you end up with l\n0 instead of l.
One way to debug this kind of this yourself is to:
Enable warnings with the -w flag. In that case, it would have printed
Use of uninitialized value $_ in scalar chomp at -e line 1, <> line 1.
This is arguably not the most helpful warning ever, but it would have helped you get an idea of where your mistake was.
Print variables to be sure that they contain what you expect. For instance, you could co perl -wE '$c=<>.chomp;print"|$c|"', which would print:
|l
0|
Which should help giving you an idea of what was wrong.

Related

Matching Uppercase words

How to match following words easily using regular expression in Perl ?
Example
AFSAS245F gdsgasdg (agadsg,asdgasdg, .ASFH(gasdgsadg) )
ASG23XLG hasdg (dagad, SgAdsga, .FG(haha))
Expected output :-
[Match First uppercase words only]
AFSAS245F
ASG23XLG

print "$1\n" if /^([A-Z0-9]+)\s+.*\(/;
this prints only the first word (followed by a newline char) if the line starts with that word followed by space(s) and a ( somewhere after.

This isn't an answer to your question (so I'm fine with it being deleted if people think that's appropriate), but I thought it would be useful to show you how this question should have been asked.
I'm trying to filter data out of the input below. I'm trying to extract the first whitespace-delimited word that consists solely of uppercase letters and digits. In addition, I need to ignore lines that don't contain (.
Here's a test program.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
# This is where I need help. This regex obviously doesn't work
print if /[A-Z]\s+/;
}
__DATA__
AFSAS245F gdsgasdg (agadsg,asdgasdg, .ASFH(gasdgsadg) )
ASG23XLG hasdg (dagad, SgAdsga, .FG(haha))
The output I'm expecting from this is:
AFSAS245F
ASG23XLG
(It's also worth pointing out that this isn't particularly good test data. You should include a line that doesn't have ( in it - as that tests an important part of the requirements.)

Perl In place edit: Find and replace in X12850 formatted file

I am new to Perl and cannot figure this out. I have a file called Test:
ISA^00^ ^00^ ^01^SupplyScan ^01^NOVA ^180815^0719^U^00204^000000255^0^P^^
GS^PO^SupplyScan^NOVA^20180815^0719^00000255^X^002004
ST^850^00000255
BEG^00^SA^0000000059^^20180815
DTM^097^20180815^0719
N1^BY^^92^
N1^SE^^92^1
N1^ST^^92^
PO1^1^4^BX^40.000^^^^^^^^IN^131470^^^1^
PID^F^^^^CATH 6FR .070 MPA 1 100CM
REF^
PO1^2^4^BX^40.000^^^^^^^^IN^131295^^^1^
PID^F^^^^CATHETER 6FR XB 3.5
REF^
PO1^3^2^EA^48.000^^^^^^^^IN^132288^^^1^
PID^F^^^^CATH 6FR AL-1 SH
REF^
PO1^4^2^BX^48.000^^^^^^^^IN^131297^^^1^
PID^F^^^^CATHETER 6FR .070 JL4SH 100CM
REF^
CTT^4^12
SE^20^00000255
GE^1^00000255
IEA^1^00000255
What I am trying to do is an in place edit, dropping any value in the N1^SE segment after the 92^. I tried this but I cant seem to make it work:
perl -i -pe 's/^N1\^SE\^\^92\^\d+$/N1^SE^^92^/g' Test
The final result should include the N1^SE segment looking like this:
N1^SE^^92^
It worked when I just had the one line in the file: N1^SE^^92^1. But when I try to globally substitute in the entire file, it doesn't work
Thanks.

You may have missed to copy here some hidden character(s) or spaces. Those may well be at the end of the line so try
perl -i -pe 's/^N1\^SE\^\^92\^\K.*//' Test
The \K is a special form of the "positive lookbehind" which drops all previous matches so only .* after it (the rest) are removed by the substitution. †
This takes seriously the requirement "dropping any value ... after", as it matches lines with things other than the sole \d from the question's example.
Or use \Q...\E sequence to escape special characters (see quotemeta)
perl -i -pe 's/^\QN1^SE^^92^\E\K.*//' Test
per Borodin's comment.
Another take is to specifically match \d as in the question
s/^N1\^SE\^\^92\^\K\d+//
per ikegami's comment. This stays true to your patterns and it also doesn't remove whatever may be hiding at the end of the line.
† The term "lookbehind" for \K is from documentation but, while \K clearly "looks behind," it has marked differences from how the normal lookbehind assertions behave.
Here is a striking example from ikegami. Compare
perl -le'print for "abcde" =~ /(?<=\w)\w/g' # prints lines: b c d e
and
perl -le'print for "abcde" =~ /\w\K\w/g' # prints lines: b d

Using Perl split function to keep (capture) some delimiters and discard others

Let's say I am using Perl's split function to split up the contents of a file.
For example:
This foo file has+ a bunch of; (random) things all over "the" place
So let's say I want to use whitespace and the semicolons a delimiters.
So I would use something like:
split(/([\s+\;])/, $fooString)
I'm having trouble figuring out a syntax (or even if it exists) to capture the semicolon and discard the whitespace.

You seem to ask for something like
my #fields_and_delim = split /\s+|(;)/, $string; # not quite right
but this isn't quite what it may seem. It also returns empty elements (with warnings) since when \s+ matches then the () captures nothing but $1 is still returned as asked, and it's undef. There are yet more spurious matches when your delimiters come together in the string.
So filter
my #fields_and_delim = grep { defined and /\S/ } split /(\s+|;)/, $string;
in which case you can normally capture the delimiter.
This can also be done with a regex
my #fields_and_delim = $string =~ /([^\s;]+|;+)/g;
which in this case allows more control over what and how you pick from the string.
If repeated ; need be captured separately change ;+ to ;

I think that what you want is as simple as:
split /\s*;\s*/, $fooString;
That will separate around the ; character that may or may not have any whitespace before or after.
In your example:
>This foo file has+ a bunch of; (random) things all over "the" place<
It would split into:
>This foo file has+ a bunch of<
and:
>(random) things all over "the" place<
By the way, you need to put the result of split into an array; for instance:
my #parts = split /\s*;\s*/, $fooString;
Then $parts[0] and $parts[1] would have the two bits.

I think grep is what you're looking for really, to filter the list for values that aren't all whitespace:
my #all_exc_ws = grep {!/^\s+$/} split(/([\s\;])/, $fooString);
Also I removed the + from your regex since it was inside the [], which changes its meaning.

Save Matched Perl Regex as Variable

I have a simple Perl regex that I need to save as a variable.
If I print it:
print($html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g);
It prints what I want to save, but when trying to save it with:
$link = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
I get back a '1' as the value of $link. I assume this is because it found '1' match. But how do I save the content of the match instead?

Note the /g to get all matches. Those can't possibly be put into a scalar. You need an array.
my #links = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
If you just want the first match:
my ($link) = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/;
Note the parens (and the lack of now-useless /g). You need them to call m// in list context.

The matched subexpressions of a pattern are saved in variables $1, $2, etc. You can also get the entire matched pattern ($&) but this is expensive and should be avoided.
The distinction in behavior here, by the way, is the result of scalar vs. list context; you should get to know them, how they differ, and how they affect the behavior of various Perl expressions.

From 'perlfunc' docs:
print LIST
Prints a string or a list of strings.
So,print m//, where m// determines that the return value
wanted (wantarray?) is a list
(It appers m// without capture groups returns 1 or 0 match pass
or fail, where as m//g returns a list of matches).
and
$link = m// can only be scalar (as opposed to list) context.
So, m// returns match results 1 (true) or 0 (false).

I just wrote code like this. It may help. It's basically like yours except mine has a few more parentheses.
my $path = `ls -l -d ~/`;
#print "\n path is $path";
($user) = ($path=~/\.*\s+(\w+)\susers/);
So yours from this example may be something like this if your trying to store the whole thing? I'm not sure but you can use mine as an example. I am storing whatever is in (\w+):
($link) = ($html_data =~ (m/<iframe id="pdfDocument" src=.(.*)pdf/g));

I have a Wordpad file from which I extract two strings and compare them. In this case they are both equal, but I cannot use the =~ expression to evaluate them.
if($pin_list =~ /$lvl_list/){ do something}
What I have tried in debug mode:
Both strings are equal as evaluated by eq
Both strings are equal as evaluated by ==
Manually set another variable to same string and then perform if statement with new variable; if($pin_list =~ /$x/){do something}. This attempt was successful.
Performed chomp(var) on both string vars several times and then ran code. FAILED
Removed carriage return via $tst_pins =~ s/\n//g on both vars. FAILED
Length of both vars is the same.
Manual printed both vars and visually verified both strings are the same.
Anyone got any ideas? I suspect it is something that has to do with WordPad and perhaps a hidden char, but don't know how to track it down.
tchrist -> Good question. In this case the strings are equal, but that will not always be the case. Under normal conditions, I am simply looking for the one string to be a subset of another.
For those who may be interested. Problem solved.
I had a string that i 'joined' with '+'. So the string looked like the following:
"1+2+3+4+a+b+etc"
The '+' ended up being the problem. At the suggestion of a colleague I performed a substr and whittled away one of the strings down to the offending point. It occurred just after it captured the '+'. I then joined using a blank space instead of the '+', and everything works.
Using different characters other than the alphabet will have an impact that I still am at a loss as to explain why when everything else said it was equal.
Bret

The match operator (m// aka //) checks if the provided string is matched by the provided regex pattern, not if it is character for character equal to the provided regex pattern. If you want to build a regex pattern that will match a string exactly, use quotemeta.
This checks if $str1 is equal to $str2:
my $pat = quotemeta($str1);
$str2 =~ /^$pat\z/
quotemeta can also be called via \Q..\E.
$str1 =~ /^\Q$str2\E\z/
Of course, you could just use eq.
$str1 eq $str2

+ and other characters have special meanings inside regular expressions, so just using $expression =~ /$some_arbitrary_string/ can get you into trouble.
If the question is whether one string is literally contained in another string, you can use index and not worry about all the rules for specifying regular expressions:
if (index($pin_list, $lvl_list) >= 0) {
do_something;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js