How can i remove a substring using regexp in Perl? - regex

Given a file containing multiple strings (one in each line) - among them there will be the following two strings:
de12_QA_IR_OS_HV
de12_IR_OS_HV
(the only difference is "QA_" in the right position).
I need to perform some action if the current line being handled contains one of the above.
if yes, i should use the string value without the "QA_" substring.
I am using the following regexp to detect the values /de12_(QA_)?IR_OS_HV/ but is there a way to remove the "QA_" if it exist using the same regexp ?

You can capture what's before and after the QA_:
if (/(de12_)(QA_)?(IR_OS_HV)/) {
print $1 . $3, "\n";
Or, use substitution
if (s/de12_(QA_)?IR_OS_HV/de12_IR_OS_HV/) {
print $_, "\n";
}
But, in fact, you know what string to use if you the regex matches:
if (/de12_(QA_)?IR_OS_HV/) {
print "de12_IR_OS_HV\n";
}

Related

Using regex to find data in between certain data [duplicate]

How can I extract a substring from within a string in Ruby?
Example:
String1 = "<name> <substring>"
I want to extract substring from String1 (i.e. everything within the last occurrence of < and >).
"<name> <substring>"[/.*<([^>]*)/,1]
=> "substring"
No need to use scan, if we need only one result.
No need to use Python's match, when we have Ruby's String[regexp,#].
See: http://ruby-doc.org/core/String.html#method-i-5B-5D
Note: str[regexp, capture] → new_str or nil
String1.scan(/<([^>]*)>/).last.first
scan creates an array which, for each <item> in String1 contains the text between the < and the > in a one-element array (because when used with a regex containing capturing groups, scan creates an array containing the captures for each match). last gives you the last of those arrays and first then gives you the string in it.
You can use a regular expression for that pretty easily…
Allowing spaces around the word (but not keeping them):
str.match(/< ?([^>]+) ?>\Z/)[1]
Or without the spaces allowed:
str.match(/<([^>]+)>\Z/)[1]
Here's a slightly more flexible approach using the match method. With this, you can extract more than one string:
s = "<ants> <pants>"
matchdata = s.match(/<([^>]*)> <([^>]*)>/)
# Use 'captures' to get an array of the captures
matchdata.captures # ["ants","pants"]
# Or use raw indices
matchdata[0] # whole regex match: "<ants> <pants>"
matchdata[1] # first capture: "ants"
matchdata[2] # second capture: "pants"
A simpler scan would be:
String1.scan(/<(\S+)>/).last

Search and replace sub-patterns using regex

I'm trying to use regular expressions to search/replace sub-patterns but I seem to be stuck. Note: I'm using TextWrangler on OSX to complete this.
SCENARIO:
Here is an example of a complete match:
{constant key0="variable/three" anotherkey=$variable.inside.same.match key2="" thirdkey='exists'}
Each match will always:
start with the following: {constant key0=
terminate with a single curly brace: }
contain one or more key=value pairs
the key of the first pair is constant (in this case, the key is key0)
the value of the first pair is variable (in this case, the value is "variable/three")
each additional pairs, if any, are separated by whitespace
Here's an example of what a minimal (but complete) match would look like (with only one key=value pair):
{constant key0="first/variable/example"}
Here's another example of a valid match, but with trailing whitespace after the last (and only) key=value pair:
{constant key0="same/as/above/but/with/whitespace/after/quote" }
GOAL:
What I need to be able to do is extract each key and each value from each match and then rearrange them. For example, I might need the following:
{constant key0="variable/4" variable_key_1="yes" variable_key_2=0}
... to look like this after all is said and done:
$variable_key_1 = "yes"; $variable_key_2 = 0; {newword "variable/4"}
... where
a $ has been added to the extracted keys
spaces have been added between each key=value pair's =
a ; has been appended to each extracted value
the word constant has been changed to newword, and
key0= has been removed completely.
Here are some examples of what I've tried (note that the first one actually works, but only when there is exactly one key/value pair):
Search:
(\{constant\s+key0=\s*)([^\}\s]+)(\s*\})
Replace:
{newword \2}
Search:
(\{constant\s+key0=)([^\s]+)(([\s]+[^\s]+)([\s]*=\s*)([^\}]+)+)(\s*\})
Replace:
I wasn't able to come up with a good way to replace the output of this one.
Any help would be most appreciated.
Because of the nature of this match, it's actually three different regexes—one to figure out what the match is, and two others to process the matches. Now, I don't know how you intend to escape the quotes, so I'll give one for each common escapement system.
Without further ado, here's the set for the backslash escapement system:
Find:
\{constant\s+key0=([^\s"]\S*|"(\\.|[^\\"])*")(\s+[^\s=]+=([^\s"]\S*|"(\\.|[^\\"])*"))*\s*\}
Search 1:
(?<=\s)([^\s=]+)=([^\s"]\S*|"(\\.|[^\\"])*")(?=.*\})
Replace 1:
$1 = $2;
Search 2:
^\{constant\s+key0 = ([^\s"]\S*|"(\\.|[^\\"])*");\s*(?=\S)(.*)\}
Replace 2:
$2 {newword $1}
Now the URL/XML/HTML escapement system, much easier to parse:
Find:
\{constant\s+key0=([^\s"]\S*|"[^"]*")(\s+[^\s=]+=([^\s"]\S*|"[^"]*"))*\s*\}
Search 1:
(?<=\s)([^\s=]+)=([^\s"]\S*|"[^"]*")(?=.*\})
Replace 1:
$1 = $2;
Search 2:
^\{constant\s+key0 = ([^\s"]\S*|"[^"]*");\s*(?=\S)(.*)\}$
Replace 2:
$2 {newword $1}
Hope this helps.

A regular expression for one only character

I have a big file with lines like the following (tab separated):
220995146 A G 1/1:8:0:0:8:301:-5,-2.40824,0 pass
221020849 G GGAGAGGCA 1/1:8:0:0:8:229:-5,-2.40824,0 pass
I'm trying to write a coitional state that will allows me to keep only the lines that in the second and the third columns will have only one character.
For example, the second line doesn't pass.
The regex that I'm using is:
if (($ref =~ m/\w{1}/) && ($allele =~ m/\w{1}/)) {
print "$mline\n";
}
But unfortunately doesn't work.
Any suggestions?
Thank you very much in advance.
Regex is not needed here, you can use the length function:
if (length($ref) == 1 && length($allele) == 1) {
print $mline,"\n";
}
I assume that $allele contains the third column. In your code, $allele =~ m/\w{1}/, you check whether it contains one word character. Instead, you want to match the whole thing. You can do this with the begin ^ and $ end matchers:
$allele =~ m/^\w{1}$/
Or just
$allele =~ /^\w$/
If you're looking for pure regex solution then use:
$re = m/^[^\t]+\t+\w\t+\w\t+.*$/ ;
RegEx Demo
This will match lines where 2nd and 3rd columns have single word character by using \w after 1 or more tabs at 2nd and 3rd position.

Replace text between START & END strings excluding the END string in perl

I was going through examples and questions on the web related to finding and replacing a text between two strings (say START and END) using perl. And I was successful in doing that with the provided solutions. In this case my START and END are also inclusive to replacement text. The syntax I used was s/START(.*?)END/replace_text/s to replace multiple lines between START and END but stop replacing when first occurrence of END is hit.
But I would like to know how to replace a text between START and END excluding END like this in perl only.
Before:
START
I am learning patten matching.
I am using perl.
END
After:
Successfully replaced.
END
To perform the check but avoid matching the characters you can use positive look ahead:
s/START.*?(?=END)/replace_text/s
One solution is to capture the END and use it in the replacement text.
s/START(.*?)(END)/replace_text$2/s
Another option is using range operator .. to ignore every line of input until you find the end marker of a block, then output the replace string and end marker:
#!/usr/bin/perl
use strict;
use warnings;
my $rep_str = 'Successfully replaced.';
while (<>) {
my $switch = m/^START/ .. /^END/;
print unless $switch;
print "$rep_str\n$_" if $switch =~ m/E0$/;
}
It is quite easy to adapt it to work for an array of string:
foreach (#strings) {
my $switch = ...
...
}
To use look-around assertions you need to redefine the input record separator ($/) (see perlvar), perhaps to slurp the while file into memory. To avoid this, the range ("flip-flop") operator is quite useful:
while (<>) {
if (/^START/../^END/) {
next unless m{^END};
print "substituted_text\n";
print;
}
else {
print;
}
}
The above preserves any lines in the output that precede or follow the START/END block.

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;