Perl - Match string between two colons - regex

My string looks like this
important stuff: some text 2: some text 3.
I want to only print "important stuff". So basically I want to print everything up to the first colon. I'm sure this is simple, but my regex foo is not so good.
Edit: Sorry I was doing something stupid and gave you a bad example line. It has been corrected.

Just restrict what you're matching to non-colons, [^:]*. Note, the ^ and : boundaries aren't actually needed, but they help document the intent behind the regex.
my $text = "important stuff: some text 2: some text 3."
if ($text =~ /^([^:]*):/) {
print "$1";
}

Consider just splitting on the colon:
use strict;
use warnings;
my $string = 'important stuff: some text 2: some text 3.';
my $important = ( split /:/, $string )[0];
print $important;
Output:
important stuff

Well, assume its a string
$test = "sass sg22gssg 22222 2222: important important :"
Assume you want all characters between.
Wrong answer: $test =~ /:(.+):/; # thank you for the change from .{1,}
Corrected.
$test =~ /:([^:]*):/;
print $1; #perl memory u can assign to a string ;
$found = $1;
As a cheat sheet of regex in perl. cheat sheet
I did test it.

Related

Matching multiline string in file using perl regex

I am reading in another perl file and trying to find all strings surrounded by quotations within the file, single or multiline. I've matched all the single lines fine but I can't match the mulitlines without printing the entire line out, when I just want the string itself. For example, heres a snippet of what I'm reading in:
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
so the output I'd like is
'Hello World!';
"chmod";
"This is a fun multiple line string, please match";
but I am getting:
'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
This is the code I am using to find the strings - all file content is stored in #contents:
my #strings_found = ();
my $line;
for(#contents) {
$line .= $_;
}
if($line =~ /(['"](.?)*["'])/s) {
push #strings_found,$1;
}
print #strings_found;
I am guessing I am only getting 'Hello World!'; correctly because I am using the $1 but I am not sure how else to find the others without looping line by line, which I would think would make it hard to find the multi line string as it doesn't know what the next line is.
I know my regex is reasonably basic and doesn't account for some caveats but I just wanted to get the basic catch most regex working before moving on to more complex situations.
Any pointers as to where I am going wrong?
Couple big things, you need to search in a while loop with the g modifier on your regex. And you also need to turn off greedy matching for what's inside the quotes by using .*?.
use strict;
use warnings;
my $contents = do {local $/; <DATA>};
my #strings_found = ();
while ($contents =~ /(['"](.*?)["'])/sg) {
push #strings_found, $1;
}
print "$_\n" for #strings_found;
__DATA__
#!/usr/bin/env perl
use warnings;
use strict;
# assign variable
my $string = 'Hello World!';
my $string4 = "chmod";
my $string3 = "This is a fun
multiple line string, please match";
Outputs
'Hello World!'
"chmod"
"This is a fun
multiple line string, please match"
You aren't the first person to search for help with this homework problem. Here's a more detailed answer I gave to ... well ... you ;) finding words surround by quotations perl
regexp matching (in perl and generally) are greedy by default. So your regexp will match from 1st ' or " to last. Print the length of your #strings_found array. I think it will always be just 1 with the code you have.
Change it to be not greedy by following * with a ?
/('"*?["'])/s
I think.
It will work in a basic way. Regexps are kindof the wrong way to do this if you want a robust solution. You would want to write parsing code instead for that. If you have different quotes inside a string then greedy will give you the 1 biggest string. Non greedy will give you the smallest strings not caring if start or end quote are different.
Read about greedy and non greedy.
Also note the /m multiline modifier.
http://perldoc.perl.org/perlre.html#Regular-Expressions

How to remove duplicate substrings from an undelimited string in perl?

I have an odd situation where I want to remove all but the first match of a substring inside of a very long undelimited string. I have found some similar topics here, but none quite like mine.
For simplicities sake, here are some sudo before and after strings.
I have an undelimited file where "c" could be thousands of random characters but "bbb" is a unique string:
aaabbbbbbccccccbbbccccccbbbccccccaaa
I want to remove all but the first bbb:
aaabbbccccccccccccccccccaaa
Also, I would like to be able to use this as a perl script I can pipe through:
cat file.in | something | perl -pe 's/bbb//g' | somethingelse > file.out
But, unlike my example above, I want to leave the first occurrence of "bbb" intact."
This seems like it should be fairly easy, but it is stumping me.
Any ideas?
Thanks in advance!
Perhaps the following will be helpful:
use strict;
use warnings;
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
$string =~ s/(?<=bbb).*?\Kbbb//g;
print $string;
Output:
aaabbbccccccccccccccccccaaa
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
my $seen;
sub first {
$seen++;
return $_[0] if $seen eq 1;
return '';
}
$string =~ s/(bbb)/first($1)/ge;
say $string;
Outputs:
aaabbbccccccccccccccccccaaa

How to substitute whitesapaces and tabs in a string with _ in perl?

$string = I am a boy
How to substitute whitespaces between words with underscore ?
You need a regular expression and the substitution operator to do that.
my $string = 'I am a boy';
$string =~ s/\s/_/g;
You can learn more about regex in perlre and perlretut. A nice tool to play around with is Rubular.
Also, your code will not compile. You need to quote your string, and you need to put a semicolon at the end.
$string = 'I am a boy';
$string =~ s/ /_/g;
$string =~ tr( \t)(_); # Double underscore not necessary as per Dave's comment
This is just to show another option in perl. I think Miguel Prz and imbabque showed more smarter ways, personally i follow the way imbabque showed.
my $str = "This is a test string";
$str =~ s/\p{Space}/_/g;
print $str."\n";
and the output is
This_is_a_test_string

Perl regex to subsitute a pattern excluding another pattern

I have a string as below.
$line = 'this is my string "hello world"';
I want to have a regex to delete all space characters inside the string except the region "Hello world".
I use below to delete space chars but it deletes all of them.
$line=~s/ +//g;
How can I exclude the region between "Hello world" and i get the string as below?
thisismystring"hello world"
Thanks
Since you probably want to handle quoted strings properly, you should have a look at the Text::Balanced module.
Use that to split your text into quoted parts and non-quoted parts, then do the replacement on the non-quoted parts only, and finally join the string together again.
Well, here's one way to do it:
use warnings;
use strict;
my $l = 'this is my string "hello world some" one two three "some hello word"';
$l =~ s/ +(?=[^"]*(?:"[^"]*"[^"]*)+$)//g;
print $l;
# thisismystring"hello world some"onetwothree"some hello word"
Demo.
But I really wonder shouldn't it be done the other way (by tokenizing the string, for example), especially if the quotes may be unbalanced.
Another regex to do it:
s/(\s+(".*?")?)/$2/g
#!/usr/bin/perl
use warnings;
use strict;
sub main {
my $line = 'this is my string "hello world"';
while ($line =~ /(\w*|(?:"[^"]*"))\s*/g) { print $1;}
print "\n";
}
main;
s/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)//g
Test the code here.

How do I print what was replaced in a substitution operation?

I understand that's a very simple question but I really failed googling it. =(
I've got something like this:
$a =~ s/(\w*)/--word was here--/g;
And I want to put into a log file which words were replaced.
aa 123 bb 234 cc → --word was here-- 123 --word was here-- 234 --word was here--
And that's okay, but I want to remember aa, bb and cc and write into a log file. What should I do?
In fact I have a link remover script but I need to remember which links were removed. I tried to simplify my task for you but made it much harder to understand - sorry.
You can use the e modifier which evaluates the right side as an expression:
$a =~ s/(\w*)/log_it($1), ""/ge;
You can do it in a loop instead of /g;:
my $string = "xxx ; yyy ; zzz";
my #replaced;
while ($string =~ s/(\w+)//) { push #replaced, $1 };
print join(",",#replaced);
# OUTPUT: xxx,yyy,zzz
Please note that \w is a WORD character, not an alphabet one, so it will match digits 0-9 as well.
If you only want to match letters, use [[:alpha:]] class
The following stores all captured words into an array:
use strict;
use warnings;
use Data::Dumper;
my $s = 'cat dog';
my #words;
while ($s =~ s/(\w+)//) {
push #words, $1;
}
print Dumper(\#words);
__END__
$VAR1 = [
'cat',
'dog'
];
Update: Now that you have added input and output data, it looks like you want to exclude numbers. In that case, you could use [a-zA-Z] instead of [\w].