I am trying to reverse engineer a Perl script. One of the lines contains a matching operator that reads:
$line =~ /^\s*^>/
The input is just FASTA sequences with header information. The script is looking for a particular pattern in the header, I believe.
Here is an example of the files the script is applied to:
>mm9_refGene_NM_001252200_0 range=chr1:39958075-39958131 5'pad=0 3'pad=0 strand=+
repeatMasking=none
ATGGCGAACGACTCTCCCGCGAAGAGCCTGGTGGACATTGACCTGTCGTC
CCTGCGG
>mm9_refGene_NM_001252200_1 range=chr1:39958354-39958419 5'pad=0 3'pad=0 strand=+
repeatMasking=none
GACCCTGCTGGGATTTTTGAGCTGGTGGAAGTGGTTGGAAATGGCACCTA
TGGACAAGTCTATAAG
This is a matching operator asking whether the line, from its beginning, contains white spaces of at least more than zero, but then I lose its meaning.
This is how I have parsed the regex so far:
from beginning [ (/^... ], contains white spaces [ ...\s... ] of at least more than zero [ ...*... }.
Using RegexBuddy (or, as r3mus said, regex101.com, which is free):
Assert position at the beginning of the string «^»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the beginning of the string «^»
Match the character “>” literally «>»
EDIT: Birei's answer is probably more correct if the regex in question is actually wrong.
You have to get rid of the second ^ character. It is a metacharacter and means the beginning of a line (without special flags like /m), but that meaning it's already achieved with the first one.
The character > will match at the beginning of the line without the second ^ because the initial whitespace is optional (* quantifier). So, use:
$line =~ /^\s*>/
It is much easier to reverse engineer perl script with debugger.
"perl -d script.pl" or if you have Linux ddd: "ddd cript.pl &".
For multiline regex this regex match for emptyline with spaces and begin of the next FASTA.
http://www.rexfiddle.net/c6locQg
Related
I'm working through the one liner book and came across
perl -pe 's/$/\n/' file
which inserts a blank line after each line by setting the end of the line to new line thus adding a new line to the existing newline resulting in a blank line.
As this is the first example without g at the end of the pattern, I tried
perl -pe 's/$/\n/g' file
this results in 2 blank lines between lines.
I would have expected no difference since there is only one $ per line so replacing all of them should be the same as replacing just the first one.
What's going on here?
/$/ matches the “end of string”. This might be
the end of string (like /\z/),
or just before a newline before the end of string (like /(?=\n\z)/).
(Additionally, /$/m matches the “end of line”. This might be
the end of string,
or just before a newline (like /(?=\n)/).
).
With your substitution /$/\n/g, the regex matches twice: once before the newline, then again at the end of string:
The first match is before the newline:
"foo\n"
# ^ match
A newline is placed before the current match end:
"foo\n\n"
# ^ insert before
The next match is at the end of string:
"foo\n\n"
# ^ match
A newline is inserted before the current match end:
"foo\n\n\n"
# ^ insert before
No further match is found.
The solution: if $ is to DWIMmy for you, always match \z or \n explicitly, possibly together with lookaheads like (?=\n). Consider matching all Unicode line separators \R instead of just \n.
This isn't a sound understanding of the situation. $ is a badly-defined and unintuitive metacharacter
It is a zero-width match
It will match before a newline character at the end of the bound string
It will match at the end of the bound string
With the /m modifier in place, it will also match before any newline character anywhere, but not immediately after it unless it is the last character of the string
\z is much more useful: it only ever matches at the end of the string
"by setting the end of the line to new line"
Mentioning "lines" at all is misleading, and you should be careful to explain in comments what meaning you're applying. If you have
my $s = "xxx\n"
then
say pos($s) while $s =~ /$/g
will produce
3
4
i.e. both before and after the newline, because it happens to be at the end of the string
This is also why your s/$/\n/g adds two newlines: there are two zero-width matches for /$/ within this string, and a global substitution finds them and replaces them both with a newline, resulting in three newlines instead of the original one
It's unclear what you intended
Adding a newline to the end of a string, regardless of what's there already is s/\z/\n/ or just $s .= "\n"
If you want to ensure that, say, there are exactly two newlines at the end of a string, then just remove any existing linefeeds first with s/\n+\z/\b\n/
As you can see, \z is much more useful than $
And don't forget \R if you're dealing with cross-platform data. It will match any standard line terminator: any of CR, LF or CRLF
If this still leaves you with a problem then please ask again. I was going to write about zero-width matches but it's hard to know whether my answer is clear without it
I have found one method, but I don't understand the principle:
#remove lines starting with //
$file =~ s/(?<=\n)[ \t]*?\/\/.*?\n//sg;
How does (?<=\n)[ \t]*? work?
The critical piece is the lookbehind (?<=...). It is a zero-width assertion, what means that it does not consume its match -- it only asserts that the pattern given inside is indeed in the string, right before the pattern that follows it.
So (?<=\n)[ \t] matches either a space or a tab, [ \t], that has a newline before it. With the quantifier, [ \t]*, it matches a space-or-tab any number of times (possibly zero). Then we have the // (each escaped by \). Then it matches any character any number of times up to the first newline, .*?\n.
Here ? makes .* non-greedy so that it stops at the first match of the following pattern.
This can be done in other ways, too.
$file =~ s{ ^ \s* // .*? \n }{}gmx
The modifier m makes anchors ^ and $ (unused here) match the beginning and end of each line. I use {}{} as delimiters so that I don't have to escape /. The modifier x allows use of spaces (and comments and newlines) inside for readability.
You can also do it by split-ing the string by newline and passing lines through grep
my $new_file = join '\n', grep { not m|^\s*//.*| } split /\n/, $file;
The split returns a list of lines and this is input for grep, which passes those for which the code in the block evaluates to true. The list that it returns is then joined back, if you wish to again have a multiline string.
If you want lines remove join '\n' and assign to an array instead.
The regex in the grep block is now far simpler, but the whole thing may be an eye-full in comparison with the previous regex. However, this approach can turn hard jobs into easy ones: instead of going for a monster master regex, break the string and process the pieces easily.
I'm trying to use vim regex to erase everything after a colon : and a space or newline character. Below is the text that I'm working with.
ablatives ablative:ablative_A s:+PL
abounded abound:abound_V ed:+PAST
abrogate abrogate:abrogate_V
abusing ab:ab_p us:use_V ing:+PCP1
accents' accent:accent_N s:+PL ':+GEN
accorded accord:accord_V ed:+PAST
So what I want to get from this is the following:
ablatives ablative s
abounded abound ed
abrogate abrogate
abusing ab us ing
accents' accent s '
accorded accord ed
I'm pretty lost on this one but I did trying the statement below:
:s/\:. / /g
I'm trying use that to get at least one of the patterns that I need.
Simply, you can do :
:%s/:\S*//g
\S non-whitespace character;
This seems to give the results you want:
:%s/:.\{-}\([ \n]\)/\1/g
: – literal :. I'm not sure why you escaped that in your question, since you don't need to.
. – mach any character.
\{-} – repeated zero or more times as little as possible ("ungreedy", like *? in many other regexp engines).
\( – start new subgroup for the reference in the replacement pattern.
[ – start "match any of these characters.
– literal space.
\n – newline.
] – end [.
\) – end subgroup.
In the replacement pattern we use \1 to insert either a space or newline, depending on what we replaced.
You can also use \_s instead of [ \n], this will match all spaces, tabs, and newlines. I personally prefer [ \n] since that's more portable across regexp engines (whereas \_s is a Vim-specific construct).
see :help <atom> for more information on any of the above.
:%s/:[^ ]*//g
This deletes : followed by zero or more non-space characters. Newline character at end of line won't be affected
I'm trying to replace multiple spaces with a single one, but at the start of the line.
Example:
___abc___def__
___ghi___jkl__
should turn to
___abc_def__
___ghi_jkl__
Note that I've replaced space with underscore
A simple search using the following pattern:
([^\s])\s+
matches the space at the end of the first line up to the space at the beginning of the next one.
So, if I replace with \1_, I get the following:
___abc_def_ghi_jkl
And that is absolutely not what I expect and regex engines, e.g., PowerGREP or the one in Visual Studio, don't behave that way.
If you want to match only horizontal spaces, use \h:
Find what: (?<=\S)\h+(?=\S)
Replace with: (a space)
There are several possible interpretations of the question. For each of them the replacement will be a single space character.
If spaces is plural and means space characters but not tabs then use
a find string of (^ {2,})|( {2,}$).
If spaces is plural and should includes tabs then use a find string
of (^[ \t]{2,})|([ \t]{2,}$).
If any leading or trailing spaces and tabs (one or more) is to be
replaced with a space then use a find string of (^[ \t]+)|([ \t]+$).
The general form of each of these is (^...)|(...$). The | means an alternation so either the preceding or the following bracketed expression can match. Hence the find what text can match either at the beginning or the end of a line. The ... varies depending on exactly what needs to be matched. Specifying [ \t] means only the two characters space and tab, whereas \s includes the line-end characters.
Ok, so the intention was to replace this:
Hey diddle diddle, \n<br/>
The Cat and the fiddle,\n
with this:
Hey diddle diddle,\n<br/>
The Cat and the fiddle,\n
A slightly modified version of Toto's answer did the trick:
(?<=\S)\h+(?=\S)|\s+$
finding any space(s) between word-characters and trailing space at the end of the line.
I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.