Regex Remove From x To y - regex

I'm new to regex, I know the basics but only the basics. I need to parse a string to remove all occurances of one string to another. For example,
Here is some random text
This wants to stay
foo
This wants to be removed
bar
And this wants to stay
So the desired output would be
Here is some random text
This wants to stay
And this wants to stay
And removed would be
foo
This wants to be removed
bar
It will always follow the pattern of match 'this string' to 'that string' and remove everything in between, including 'this string' and 'that string'.
The file is a text file, for the sake of this question, the pattern will always start with foo and end with bar, removing foo, bar and everything in between.
Foo and Bar ARE part of the file and need removing.

Regexes are probably the wrong tool here. I'd probably use string equality along with the flip-flop operator.
while (<$input_fh>) {
print $output_fh unless ($_ eq "foo\n" .. $_ eq "bar\n");
}
You could do it with a regex and a match operator.
while (<$input_fh>) {
print $output_fh unless /foo/ .. /bar/;
}
That looks neater, but the regexes will match if the strings appear anywhere on an input line.
Update: Inverted the logic on the tests - so it's now correct.

Are you looking for something like this?
#!/usr/bin/perl
$start = "foo";
$end = "bar";
while (<STDIN>) {
$str = $str . $_;
}
$str =~ s/(.*)$start\n.*$end\n(.*)/\1\2/s;
print $str;
The only part of real importance to you is the regex I suppose, but I declare the start and end, then read from standard input and tack each concurrent line onto $str. Then I take str and say "whatever is the first thing within perenthesis before foo put first, whatever is in the second after bar parenthesis put last" (with the backslash \1 and \2)
My output from a file containing your lines is:
marshall#marshall-desktop:~$ cat blah | ./haha
Here is some random text
This wants to stay
And this wants to stay

That's not what RegEx is there for. RegEx is there to detect pattern - if you want simple string slice, you should simply iterate over the big string with a simple comparison (or, with other languages which include string operations, indexOf("your string here"); etc. )
However, simple typing of the string would find you the matches:
This wants to be removed will return all occurances of that specific string, and thus it is fit for you.

Related

better way to replace text

I am trying to separate a line of text with space in between.
here is my code
$Text= "hellohello"
if($Text -match "(\w+)(o)(\w+)") {$Text = ($Matches[1] + $Matches[2] -replace "o", "o ")+$Matches[3]}
What is a better way to do it? let's say if changed the text to "manymany", I want powershell to auto identifies the first word and add space in between.
The question you ask can be big and small. Identify a word programmatically is not an easy topic.
For your case, the easy one, I assume you just want to find the repeated strings and insert a space between.
The regular expression is easy enough to do it.
"yesyesnonolarryhellohello" -replace "(\w{2,}?)\1",'$1 $1'
Output
yes yesno nolarryhello hello
.{2,}? any string at least 2 characters, () to mark as a reference
\1 to refer the first matched reference
so, (.{2,}?)\1 can match yesyes, okok and hellohello, in such case, $1 is the value of yes, ok and hello

Using Perl split function to keep (capture) some delimiters and discard others

Let's say I am using Perl's split function to split up the contents of a file.
For example:
This foo file has+ a bunch of; (random) things all over "the" place
So let's say I want to use whitespace and the semicolons a delimiters.
So I would use something like:
split(/([\s+\;])/, $fooString)
I'm having trouble figuring out a syntax (or even if it exists) to capture the semicolon and discard the whitespace.
You seem to ask for something like
my #fields_and_delim = split /\s+|(;)/, $string; # not quite right
but this isn't quite what it may seem. It also returns empty elements (with warnings) since when \s+ matches then the () captures nothing but $1 is still returned as asked, and it's undef. There are yet more spurious matches when your delimiters come together in the string.
So filter
my #fields_and_delim = grep { defined and /\S/ } split /(\s+|;)/, $string;
in which case you can normally capture the delimiter.
This can also be done with a regex
my #fields_and_delim = $string =~ /([^\s;]+|;+)/g;
which in this case allows more control over what and how you pick from the string.
If repeated ; need be captured separately change ;+ to ;
I think that what you want is as simple as:
split /\s*;\s*/, $fooString;
That will separate around the ; character that may or may not have any whitespace before or after.
In your example:
>This foo file has+ a bunch of; (random) things all over "the" place<
It would split into:
>This foo file has+ a bunch of<
and:
>(random) things all over "the" place<
By the way, you need to put the result of split into an array; for instance:
my #parts = split /\s*;\s*/, $fooString;
Then $parts[0] and $parts[1] would have the two bits.
I think grep is what you're looking for really, to filter the list for values that aren't all whitespace:
my #all_exc_ws = grep {!/^\s+$/} split(/([\s\;])/, $fooString);
Also I removed the + from your regex since it was inside the [], which changes its meaning.

How to do conditional ("if exist" logic) search & replace in Perl?

in my Perl script I want to do conditional search & replace using regular expression: Find a certain pattern, and if the pattern exists in a hash, then replace it with something else.
For example, I want to search for a combination of "pattern1" and "pattern2", and if the latter exists in a hash, then replace the combination with "pattern1" and "replacement". I tried the following, but it just doesn't do anything at all.
$_ =~ s/(pattern1)(pattern2)/$1replacement/gs if exists $my_hash{$2};
I also tried stuff like:
$_ =~ s/(pattern1)(pattern2) && exists $my_hash{$2}/$1replacement/gs;
Also does nothing at all, as if no match is found.
Can anyone help me with this regex problem? Thx~
I would do it a different way. It looks like you have a 'search this, replace that' hash.
So:
#!/usr/bin/env perl
use strict;
use warnings;
#our 'mappings'.
#note - there can be gotchas here with substrings
#so make sure you anchor patterns or sort, so
#you get the right 'substring' match occuring.
my %replace = (
"this phrase" => "that thing",
"cabbage" => "carrot"
);
#stick the keys together into an alternation regex.
#quotemeta means regex special characters will be escaped.
#you can remove that, if you want to use regex in your replace keys.
my $search = join( "|", map {quotemeta} keys %replace );
#compile it - note \b is a zero width 'word break'
#so it will only match whole words, not substrings.
$search = qr/\b($search)\b/;
#iterate the special DATA filehandle - for illustration and a runnable example.
#you probably want <> instead for 'real world' use.
while (<DATA>) {
#apply regex match and replace
s/(XX) ($search)/$1 $replace{$2}/g;
#print current line.
print;
}
##inlined data filehandle for testing.
__DATA__
XX this phrase cabbage
XX cabbage carrot cabbage this phrase XX this phrase
XX no words here
and this shouldn't cabbage match this phrase at all
By doing this, we turn your hash keys into a regex (you can print it - it looks like: (?^:\b(cabbage|this\ phrase)\b)
Which is inserted into the substitution pattern. This will only match if the key is present, so you can safely do the substitution operation.
Note - I've added quotemeta because then it escapes any special characters in the keys. And the \b is a "word boundary" match so it doesn't do substrings within words. (Obviously, if you do want that, then get rid of them)
The above gives output of:
XX that thing cabbage
XX carrot carrot cabbage this phrase XX that thing
XX no words here
and this shouldn't cabbage match this phrase at all
If you wanted to omit lines that didn't pattern match, you can stick && print; after the regex.
What is wrong (as in not working) with
if (exists($h{$patt1)) { $text =~ s/$patt1$patt2/$patt1$1replacement/g; }
If $patt1 exists as a key in a hash then you go ahead and replace $patt1$patt2 with $patt1$replacement. Of course, if $patt1$patt2 is found in $text, otherwise nothing happens. Your first code snippet is circular, while the second one can't work like that at all.
If you want $patt1$patt2 first, and hash key as well then it seems that you'd have to go slow
if ($str =~ /$patt11$patt2/ && exists $h{$patt2}) {
$str =~ s/$patt1$patt2/$patt1$replacement/gs;
}
If this is what you want then it is really simple: you need two unrelated conditions, whichever way you turn it around. Can't combine them since it would be circular.
From the point of view of the outcome these are the same. If either condition fails nothing happens, regardless of the order in which you check them.
NOTE Or maybe you don't have to go slow, see Sobrique's post.

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.