Remove string after finding specific character

Remove string after finding specific character - regex

My Perl program only removes the last three characters of the string. Currently, I am finding a way to find the count including + and remove using substr or if there is any built-in function in Perl.
open my $hfile, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
while( my $line = <$hfile> ){
if ($line =~ /+/){
$line = substr($line, -3);
print $line;
}
}
close $hfile;
Input file
hello_aba+32
gaww_ajnd_arhb+176
ajnbjsdsjn+416
Output file
hello_aba
gaww_ajnd_arhb
ajnbjsdsjn

This is a classic task for a perl one-liner. I would use some caution when deleting after a control character. First off, anchor it to the end of line. Second, make sure only expected characters are deleted.
Assuming the characters to delete 1) does not include plus-signs +, 2) does not include newlines, I would write:
perl -pe' s/\+[^+\n]*$// ' file.txt
[^ ... ] is a character class, and it is negated with ^ to mean "match any character that does not match what is inside".
While + following nothing is probably considered a literal plus and not a meta character, I think escaping it \+ is proper, and prevents future update errors. Assuring that the rest of the line does not contain + assures that any extra plus signs do not cause us to lose data, e.g.
if foo = 2, then foo + bar = 4+123
# ^ first ^ second
Adding $ for end of line will anchor the match to the end of the line. This will prevent any extra + signs to mess up our input. Otherwise it would delete between the two first plus signs found.
Since we do not delete the line endings \n, the file structure remains unchanged.
Demonstration:
$ cat plus.txt
hello_aba+32
gaww_ajnd_arhb+176
ajnbjsdsjn+416
foo + bar = 3+123
$ perl -pe' s/\+[^+\n]*$//' plus.txt
hello_aba
gaww_ajnd_arhb
ajnbjsdsjn
foo + bar = 3
If you want to change the input file, you can either use redirection:
$ perl -pe ..... > newfile.txt
Or add the -i switch to perform in-place edit:
$ perl -pi.bak -e ....
(.bak will create a backup file with extension .bak). Note that the original file is overwritten, so use caution.

To remove anything after "+" in your lines, use a substitution regex:
$line =~ s/\+.*// && print $line;

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?

A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.

You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...

This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).

This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

How to remove the whitespaces in fasta file using perl?

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?

Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas

use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS

grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;

Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.

Regex question!

I'm not too familiar with regex but I know what I need to find-
I have a long list of data separated by newlines, and I need to delete all the lines of data that contain a string "(V)". The lines are of variable length, so I guess something to do with selecting everything between two newline characters if there's a (V) inside?

Try searching for this regular expression:
^.*\(V\).*$
Explanation:
^ start of line
.* any characters apart from new line
\( open parenthesis (escaped to avoid special behaviour)
V V
\) close parenthesis (escaped to avoid special behaviour)
.* any characters apart from new line
$ end of line (not strictly need here, included only for clarity)
Depending on your language you may need to add delimiters such as / and/or quotes " around the regular expression and you may need to enable multiline mode.
Here's an online example showing it working: Rubular

If the data is indeed rather large, then running a single regex against the whole string would be a bad idea. Instead, a simple solution like this Perl script could work for you:
open my $fh, '<', 'data.txt' or die $!;
while (my $line = <$fh>) {
if ($line =~ m/\(V\)/) {
next;
}
print $line;
}
close $fh;
This script reads the data file one line at a time and prints the lines that do not contain "(V)" to stdout. (You obviously could replace the "print" with a different data processing task)

Use the UNIX command grep, if you have access to such a system.
$ grep -v '(V)' data.txt
Grep matches all lines containing "(V)" in data.txt, and shows only the lines not matching (-v).

How can I match end-of-line multiple times in a regex without interpolation?

if I have a input with new lines in it like:
[INFO]
xyz
[INFO]
How can I pull out the xyz part using $ anchors? I tried a pattern like /^\[INFO\]$(.*?)$\[INFO\]/ms, but perl gives me:
Use of uninitialized value $\ in regexp compilation at scripts\t.pl line 6.
Is there a way to shut off interpolation so the anchors work as expected?
EDIT: The key is that the end-of-line anchor is a dollar sign but at times it may be necessary to intersperse the end-of-line anchor through the pattern. If the pattern is interpolating then you might get problems such as uninitialized $\. For instance an acceptable solution here is /^\[INFO\]\s*^(.*?)\s*^\[INFO\]/ms but that does not solve the crux of the first problem. I've changed the anchors to be ^ so there is no interpolation going on, and with this input I'm free to do that. But what about when I really do want to reference EOL with $ in my pattern? How do I get the regex to compile?

The question is academic--there's no need for the $ anchors in your regex anyway. You should be using \n to match the newlines, because the $ only matches the gap between the linefeed and the character before it.
EDIT: What I'm trying to say is that you will never need to use $ that way. Any match that spans from one line to the next will have to consume the line separator somehow. Consider your example:
/^\[INFO\]$(.*?)$\[INFO\]/ms
If this did compile, the (.*?) would start out by consuming the first linefeed and keep going until it had matched \nxyz, where the second $ would succeed. But the next character is a linefeed, and the regex is looking for [, so that doesn't work. After backtracking, the (.*?) would reluctantly consume one more character--the second linefeed--but then the $ would fail.
Any time you try to match an EOL with $ and then some more stuff, the first "stuff" you'll have to match will be the linefeed, so why not match that instead? That's why the Perl regex compiler tries to interpret $\ as a variable name in your regex: it makes no sense to have an end-of-line anchor followed by a character that's not a line separator.

Based on the answer in perlfaq6 - How can I pull out lines between two patterns that are themselves on different lines? , here's what a one-liner would look like:
perl -0777 -ne 'print $1,"\n" while /\[INFO\]\s*(.*?)\s*\[INFO\]/sg' file.txt
The -0777 switch slurps in the whole file at once.
However, if you're after a subroutine that gives you the flexibility to choose what tag you want to extract, the File::Slurp module makes things a little easier:
use strict;
use warnings;
use File::Slurp qw/slurp/;
sub extract {
my ( $tag, $fileName ) = #_;
my $text = slurp $fileName;
my ($info) = $text =~ /$tag\s*(.*?)\s*$tag/sg;
return $info;
}
# Usage:
extract ( qr/\[INFO\]/, 'file.txt' );

When regexes get too tricky, they probably are the wrong tool. I might consider using the flip flop operator here. It's false until its lefthand side is true, then stays true until its righthand side is true. That way, you can choose where to start and end the extraction just by looking at individual lines:
my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> )
{
next if /\[INFO]/ .. /\[INFO]/;
chomp;
print "Extracted <$_>\n";
}
If you are using Perl 5.10, you can use the generalized line ending \R in a regex:
use 5.010;
my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE
my( $extracted ) = $string =~ /(?:\A|\R)\[INFO]\R(.*?)\R\[INFO]\R/;
print "Extracted <$extracted>\n";
Don't get hung up on the end-of-line anchor.

Maybe the /x modifier can help:
m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n
^ \[INFO\] # Match another INFO line
/xms
I haven't tested that, so you'd probably have to debug it. But I think this will prevent the $ symbols from interpolating as variables.

Although I've accepted Alan Moore's answer (Ryan Thompson's answer would also have done the trick too bad I could only accept one) I wanted to make perfectly clear the solution, as it was kind of buried in the comments and discussion. The following Perl script demonstrates that Perl is using the $ to interpolate variables if any character proceeds the dollar sign, and that turning off interpolation will allow the $ to be treated as EOL.
use strict;
use warnings;
my $x = "[INFO]\nxyz\n[INFO]";
if( $x =~ /^\[INFO\]$\n(.*?)$\n\[INFO\]/m ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
if( $x =~ m'^\[INFO\]$\n(.*?)$\n\[INFO\]'m ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
if( $x =~ m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n
^ \[INFO\] # Match another INFO line
/xms ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
The script produces the following output:
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
NO MATCH FOUND
'xyz' FOUND
'xyz' FOUND

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove string after finding specific character - regex

To remove anything after "+" in your lines, use a substitution regex: $line =~ s/\+.*// && print $line;

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

Repeating regex pattern

How to remove the whitespaces in fasta file using perl?

Regex question!

How can I match end-of-line multiple times in a regex without interpolation?

Categories

Resources