finding words surround by quotations perl

finding words surround by quotations perl - regex

I am reading another perl file line by line and need to find any words or set of words surround by single or double quotations. This is an example of the code I am reading in:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = 'Hello World!';
print "$string\n";
Basically, I need to find and print out 'Hello World!' and "$string\n".
I've read my file in fine and stored its contents in an array. From there I'm looping over each line and find the desired set of words in the quotations using regex as such:
for(#contents) {
if(/\"|\'[^\"|\']*\"|\'/) {
print $_."\n";
}
}
which gives me the following output:
my $string = 'Hello World!';
print "$string\n";
I tried splitting the contents by whitespace and then trying to find a match, but that gives me this:
'Hello
World!'
"$string\n";
I've tried numerous solutions other suggested on here but to no avail. I have also tried Text::ParseText and using parse_line, but that gives me the complete wrong output.
Any ideas that could help me?

Just need to add some capturing parenthesis to your regex, instead of printing the whole line
use strict;
use warnings;
while (<DATA>) {
if(/(["'][^"']*["'])/) {
print "$1\n";
}
}
__DATA__
#!/usr/bin/env perl
use strict;
use warnings;
my $string = 'Hello World!';
print "$string\n";
Note, there are plenty of flaws in your regex though. For example '\'' Won't match properly. Neither will "He said 'boo'". To get closer you'll have to do some balanced parenthesis checking, but there isn't going to be any perfect solution.
For a solution that is a little closer, you could use the following:
if(/('(?:(?>[^'\\]+)|\\.)*'|"(?:(?>[^"\\]+)|\\.)*")/) {
That would take care of my above exceptions and also strings like print "how about ' this \" and ' more \n";, but there are still edge cases like the use of qq{} or q{}. Not to mention strings that span more than one line.
In other words, if your goal is perfect, this project may be outside of the scope of most people's skills, but hopefully the above will be of some help.

Maybe you can have more than one "string" to capture per line, one solution could be:
while(my $line=<STDIN>) {
while( $line =~ /[\'\"](.*?)[\'\"]/g ) {
print "matched: '$1'\n";
}
}
ie, input:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = 'Hello World!' . 'asdsad';
print "$string\n";
and executing the code will give you:
matched: 'Hello World!'
matched: 'asdsad'
matched: '$string\n'

Related

Matching multiline string in file using perl regex

I am reading in another perl file and trying to find all strings surrounded by quotations within the file, single or multiline. I've matched all the single lines fine but I can't match the mulitlines without printing the entire line out, when I just want the string itself. For example, heres a snippet of what I'm reading in:
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
so the output I'd like is
'Hello World!';
"chmod";
"This is a fun multiple line string, please match";
but I am getting:
'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
This is the code I am using to find the strings - all file content is stored in #contents:
my #strings_found = ();
my $line;
for(#contents) {
$line .= $_;
}
if($line =~ /(['"](.?)*["'])/s) {
push #strings_found,$1;
}
print #strings_found;
I am guessing I am only getting 'Hello World!'; correctly because I am using the $1 but I am not sure how else to find the others without looping line by line, which I would think would make it hard to find the multi line string as it doesn't know what the next line is.
I know my regex is reasonably basic and doesn't account for some caveats but I just wanted to get the basic catch most regex working before moving on to more complex situations.
Any pointers as to where I am going wrong?

Couple big things, you need to search in a while loop with the g modifier on your regex. And you also need to turn off greedy matching for what's inside the quotes by using .*?.
use strict;
use warnings;
my $contents = do {local $/; <DATA>};
my #strings_found = ();
while ($contents =~ /(['"](.*?)["'])/sg) {
push #strings_found, $1;
}
print "$_\n" for #strings_found;
__DATA__
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
Outputs
'Hello World!'
"chmod"
"This is a fun
multiple line string, please match"
You aren't the first person to search for help with this homework problem. Here's a more detailed answer I gave to ... well ... you ;) finding words surround by quotations perl

regexp matching (in perl and generally) are greedy by default. So your regexp will match from 1st ' or " to last. Print the length of your #strings_found array. I think it will always be just 1 with the code you have.
Change it to be not greedy by following * with a ?
/('"*?["'])/s
I think.
It will work in a basic way. Regexps are kindof the wrong way to do this if you want a robust solution. You would want to write parsing code instead for that. If you have different quotes inside a string then greedy will give you the 1 biggest string. Non greedy will give you the smallest strings not caring if start or end quote are different.
Read about greedy and non greedy.
Also note the /m multiline modifier.
http://perldoc.perl.org/perlre.html#Regular-Expressions

Perl - Match string between two colons

My string looks like this
important stuff: some text 2: some text 3.
I want to only print "important stuff". So basically I want to print everything up to the first colon. I'm sure this is simple, but my regex foo is not so good.
Edit: Sorry I was doing something stupid and gave you a bad example line. It has been corrected.

Just restrict what you're matching to non-colons, [^:]*. Note, the ^ and : boundaries aren't actually needed, but they help document the intent behind the regex.
my $text = "important stuff: some text 2: some text 3."
if ($text =~ /^([^:]*):/) {
print "$1";
}

Consider just splitting on the colon:
use strict;
use warnings;
my $string = 'important stuff: some text 2: some text 3.';
my $important = ( split /:/, $string )[0];
print $important;
Output:
important stuff

Well, assume its a string
$test = "sass sg22gssg 22222 2222: important important :"
Assume you want all characters between.
Wrong answer: $test =~ /:(.+):/; # thank you for the change from .{1,}
Corrected.
$test =~ /:([^:]*):/;
print $1; #perl memory u can assign to a string ;
$found = $1;
As a cheat sheet of regex in perl. cheat sheet
I did test it.

How to remove duplicate substrings from an undelimited string in perl?

I have an odd situation where I want to remove all but the first match of a substring inside of a very long undelimited string. I have found some similar topics here, but none quite like mine.
For simplicities sake, here are some sudo before and after strings.
I have an undelimited file where "c" could be thousands of random characters but "bbb" is a unique string:
aaabbbbbbccccccbbbccccccbbbccccccaaa
I want to remove all but the first bbb:
aaabbbccccccccccccccccccaaa
Also, I would like to be able to use this as a perl script I can pipe through:
cat file.in | something | perl -pe 's/bbb//g' | somethingelse > file.out
But, unlike my example above, I want to leave the first occurrence of "bbb" intact."
This seems like it should be fairly easy, but it is stumping me.
Any ideas?
Thanks in advance!

Perhaps the following will be helpful:
use strict;
use warnings;
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
$string =~ s/(?<=bbb).*?\Kbbb//g;
print $string;
Output:
aaabbbccccccccccccccccccaaa

my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
my $seen;
sub first {
$seen++;
return $_[0] if $seen eq 1;
return '';
}
$string =~ s/(bbb)/first($1)/ge;
say $string;
Outputs:
aaabbbccccccccccccccccccaaa

Perl regex to subsitute a pattern excluding another pattern

I have a string as below.
$line = 'this is my string "hello world"';
I want to have a regex to delete all space characters inside the string except the region "Hello world".
I use below to delete space chars but it deletes all of them.
$line=~s/ +//g;
How can I exclude the region between "Hello world" and i get the string as below?
thisismystring"hello world"
Thanks

Since you probably want to handle quoted strings properly, you should have a look at the Text::Balanced module.
Use that to split your text into quoted parts and non-quoted parts, then do the replacement on the non-quoted parts only, and finally join the string together again.

Well, here's one way to do it:
use warnings;
use strict;
my $l = 'this is my string "hello world some" one two three "some hello word"';
$l =~ s/ +(?=[^"]*(?:"[^"]*"[^"]*)+$)//g;
print $l;
# thisismystring"hello world some"onetwothree"some hello word"
Demo.
But I really wonder shouldn't it be done the other way (by tokenizing the string, for example), especially if the quotes may be unbalanced.

Another regex to do it:
s/(\s+(".*?")?)/$2/g

#!/usr/bin/perl
use warnings;
use strict;
sub main {
my $line = 'this is my string "hello world"';
while ($line =~ /(\w*|(?:"[^"]*"))\s*/g) { print $1;}
print "\n";
}
main;

s/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)//g
Test the code here.

Why doesn't the first replacement have any effect?

Most probably I'm missing something obvious here, but why do I need to call the search/replace regex twice to have any effect in the following code? If I call it only once, the replacement doesn't take place :-(
use strict;
use warnings;
use LWP::Simple;
my $youtubeCN = get(shift #ARGV);
die("Script tag not found!\n")
unless $youtubeCN =~ /<script src="(.*?)">/;
my $youtubeScr = $1;
# WHY ???
$youtubeScr =~ s/&/&/g;
$youtubeScr =~ s/&/&/g;
my $gmodScr = get($youtubeScr);
$gmodScr =~ s/http:\/\/\?container/http:\/\/www.gmodules.com\/ig\/ifr\?/;
print "<script type=\"text/javascript\">$gmodScr</script>\n";
Update: I call this script like this:
perl bork_youtube_channel.pl 'http://www.youtube.com/user/pennsays'
If amp isn't properly transformed into &, I will get back an HTML page (probably an error page) rather than Javascript at step 2.
Update: It turns out that the URL was double encoded after all. Thank you all for your help!

I suspect that if you look at the input data, it is doing the right thing - my guess is that in the middle of encoding and decoding, you're not seeing the real input and output. For example, try this:
use strict;
use warnings;
my $youtubeScr = "a&b";
$youtubeScr =~ s/&/&/g;
print $youtubeScr;
print "\n";
$youtubeScr =~ s/&/&/g;
print $youtubeScr;
print "\n";
This prints
a&b
a&b
In other words, it's already worked to start with.
Are you sure your original text isn't foo&amp;bar? That would give output of
foo&bar
foo&bar
with the above code.
PS My perl-fu sucks. Apologies for any language abuses in the above code, but I think it should still be helpful :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

finding words surround by quotations perl - regex

Related

Matching multiline string in file using perl regex

Perl - Match string between two colons

How to remove duplicate substrings from an undelimited string in perl?

Perl regex to subsitute a pattern excluding another pattern

Why doesn't the first replacement have any effect?

Categories

Resources