I want to match a regex like:
.+|(.+)
but sometimes the input is like:
.+|.+|.+|.+|.+
In other words, I don't know how many pipe characters | are in the input string, but I know I want to extract whatever is to the right of the rightmost |.
In other words, I don't know how many pipe characters | are in the input string, but I know I want to extract whatever is to the right of the rightmost |
You can use the following:
[^|]+$
Regular expression:
[^|]+ any character except: '|' (1 or more times)
$ before an optional \n, and the end of the string
So for example using grep:
echo ".+|.+|.+|.+|foo" | grep -Eo '[^|]+$'
# => 'foo'
You could also use a one-liner to do this, Example:
perl -nle 'print $_ for (split /\|/)[-1]' file
Assuming the end of your example is the end of the string/line, you can specify the end of line to get the value on the right of the rightmost pipe:
^.+\|(.+)$
Demo: http://regex101.com/r/wR3lP2
Instead of using ., use a character class matching any character other than |:
^.+\|([^|]+)$
You may want to use ^.+\|([^|]+).*$, which in the the case the string ends with |, will capture like so:
.+|.+|.+|.+|.+|
Captures .+
The other suggestions that don't negate the | captures .+|
Consider the following Regex...
\|(\.\+)$
Good Luck!
or use this simple pattern ([^|]+)$
Related
my character set is
-68,-79,-72,-70,-71,-71,-71,-71,-72,-73,R2,0000feaa-0000-1000-8000-00805f9b34fb
I want like
-68 -79 -73
and my regular expression is
[-][0-9]{2}[^0-9]
and result like
-68, -79,
I want to exclude comma in my character set
how can I solve my problem
Thank you for your help
Based on your regex and your results, I assume you are finding multiple matches and then putting spaces between each match. Let me break down what your regex is doing:
[-] matches the negative sign
[0-9]{2} matches two digits
[^0-9] matches any non-digit character, including a comma. So the commas are part of your match
If you want to exclude the commas from your match, but still assert that they are there, you need to use a positive lookahead. This is done like so:
[-][0-9]{2}(?=[^0-9])
Already said this in the comments but will post answer just for the sake of completion.
The solution to this isn't exactly regex. It's the replace function of whatever tool you're using. All you have to do is replace the , by a (space).
For example, in python .replace(',', ' ') is sufficient
which language are you using?
For example:
sed
echo "-34,-35,-34" | sed 's/,/ /g'
awk
echo "-34,-35,-34" | awk '{gsub(/,/, " ", $0); print $0}'
What's the regular expression for finding all instances of a comma, without a trailing space, between words? e.g. "someword,otherword"?
I'm using this pattern in Eclipse's search tool:
([^,\n\s']),([^,\n\s\)\]'])
which works perfectly, but when I use this same pattern with grep like:
grep -nHIirE -- ([^,\n\s']),([^,\n\s\)\]'])
it finds nothing. What am I doing wrong?
Use word boundaries \b:
echo "abc,def ghi, jkl" | grep '\b,\b'
(find the first comma, but not the second)
Passing ([^,\n\s']),([^,\n\s\)\]']) to grep is a string literal search, you need to put it in single or double quotes to make it a regex search "([^,\n\s']),([^,\n\s\)\]'])"
Using grep this pattern should work:
grep -nHIir -- "[^,'[:space:]],[^],)[:space:]']"
Use [:space:] to match any whitespace including newlines
Importantly don't escape ] inside a negated character class. Just place it at first position.
Avoid unnecessary grouping
You don't even need extended regex flavor for this
echo "This is a test string" | sed 's/This/\0/'
First I match substring This using the regex This. Then I replace the entire string with the first match using \0. So the result should be just the matched string.
But it prints out the entire line. Why is this so?
You don't replace the whole string with \0, just the pattern match, which is This. In other words, you replace This with This.
To replace the whole line with This, you can do:
echo "This is a test string" | sed '/This/s/.*/This/'
It looks for a line matching This, and replaces the whole line with This. In this case (since there is only one line) you can also do:
echo "This is a test string" | sed 's/.*/This/'
If you want to reuse the match, then you can do
echo "This is a test string" | sed 's/.*\(This\).*/\1/'
\( and \) are used to remember the match inside them. It can be referenced as \1 (if you have more than one pair of \( and \), then you can also use \2, \3, ...).
In the example above this is not very helpful, since we know that inside \( and \) is the word This, but if we have a regex inside the parentheses that can match different words, this can be very helpful.
sed 's/.*\(PatThis\).*/PatThat/'
or
se '/PatThis/ s/.*/PatThat/'
In your request "PatThis" and "PatThat" are the same contain ("This"). In the comment (
I need to select a number using \d\d\d\d and then use it as
replacement
) you have 2 different value for the pattern PatThis and PatThat
the \1 is not really needed because you know the exact contain (unless 'PatThis' is a regex with special char like \ & ? .)
I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?
Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.
Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2
Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3
I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2