I would like to write a regex in Perl which will remove everything after the last comma in a string. I know the substring after the last comma is a number or some other substring, so no commas there.
Example: some\string,/doesnt-really.metter,5.
I would like the regex to remove the last comma and the 5 so the output would be: some\string,/doesnt-really.metter
I am not allowed to use any additional module only with regex. So which regex should I use?
Another example:
string_with,,,,,_no_point,some_string => string_with,,,,,_no_point
If the comma is always followed by one or more digits, you can use: s/,\d+$//. More generally, use s/,[^,]*$// (match a comma followed by zero or more non-comma characters followed by end-of-string).
This Regex captures everything before the last ,.
(.*),[^,]*$
perl -n -e 'chomp; s/(.+,)/$1/g; print "$_\n";' inputfile.txt
Just run this command directly on terminal, the regex just selects all text which comes before last comma).
Related
I'm having a bunch of comma separated CSV files.
I would like to replace exact one value which is between the third and fourth comma. I would love to do this with Notepad++ 'Find in Files' and Replace functionality which could use RegEx.
Each line in the files look like this:
03/11/2016,07:44:09,327575757,1,5434543,...
The value I would like to replace in each line is always the number 1 to another one.
It can't be a simple regex for e.g. ,1, as this could be somewhere else in the line, so it must be the one after the third and before the fourth comma...
Could anyone help me with the RegEx?
Thanks in advance!
Two more rows as example:
01/25/2016,15:22:55,276575950,1,103116561,10.111.0.111,ngd.itemversions,0.401,0.058,W10,0.052,143783065,,...
01/25/2016,15:23:07,276581704,1,126731239,10.111.0.111,ll.browse,7.133,1.589,W272,3.191,113273232,,...
You can use
^(?:[^,\n]*,){2}[^,\n]*\K,1,
Replace with any value you need.
The pattern explanation:
^ - start of a line
(?:[^,\n]*,){2} - 2 sequences of
[^,\n]* - zero or more characters other than , and \n (matched with the negated character class [^,\n]) followed with
, - a literal comma
[^,\n]* - zero or more characters other than , and \n
\K - an operator that forces the regex engine to discard the whole text matched so far with the regex pattern
,1, - what we get in the match.
Note that \n inside the negated character classes will prevent overflowing to the next lines in the document.
You can replace value between third and fourth comma using following regex.
Regex: ([^,]+,[^,]+,[^,]+),([^,]+)
Replacement to do: Replace with \1,value. I used XX for demo.
Regex101 Demo
Notepad++ Demo
I have a file which have the data something like this
34sdf, 434ssdf, 43fef,
34sdf, 434ssdf, 43fef, sdfsfs,
I have to identify the sdfsfs, and replace it and/or print the line.
The exact condition is the tokens are comma separated. target expression starts with a non numeric character, and till a comma is met.
Now i start with [^0-9] for starting with a non numeric character, but the next character is really unknown to me, it can be a number, a special char, an alphabet or even a space. So I wanted a (anything)*. But the previous [] comes into play and spoils it. [^0-9]* or [^0-9].*, or [^0-9]\+.*, or [^0-9]{1}*, or [^0-9][^,]* or [^0-9]{1}[^\,]*, nothing worked till now. So my question is how to write a regex for this (starting character a non numeric, then any character except a comma or any number of character till comma) I am using grep and sed (gnu). Another question is for posix or non-posix, any difference comes there?
Something like that maybe?
(?:(?:^(\D.*?))|(?:,\s(\D.*?))),
This captures the string that starts with a non-numeric character. Tested here.
I'm not sure if sed supports \D, but you can easily replace it with [^0-9] if not, which you already know.
EDIT: Can be trimmed to:
(?:\s|^)(\D.*?),
With sed, and slight modifications to your last regex:
sed -n 's/.*,[ ]*\([^ 0-9][^\,]*\),/\1/p' input
I think pattern (\s|^)(\D[^,]+), will catch it.
It matches white-space or start of string and group of a non-digit followed by anything but comma, which is followed by comma.
You can use [^0-9] if \D is not supported.
This might work for you (GNU sed):
sed '/\b[^0-9,][^,]*/!d' file # only print lines that match
or:
sed -n 's/\b[^0-9,][^,]*/XXX/gp' file # substitute `XXX` for match
In one regex ksh line I need to:
look for the occurrence of a particular string followed by any number of characters up to the last occurrence of a particular value (in this case a comma),
copy the stuff matched to the output, and then
insert a new value after the copied text and before the last occurrence of the particular value (in this case a comma)
So, if my input string looked like this:
SEARCH_STRING anything_else(foo,bar),
What I'd like to output is this:
SEARCH_STRING anything_else(foo,bar) INSERTED_VALUE,
So far, my sed expression looks like this (which only matches and copies everything up to the first occurrence of the comma, not up to the last):
sed -e 's/SEARCH_STRING [^,]\+/& INSERTED_VALUE/'
...which results in this:
SEARCH_STRING anything_else(foo INSERTED_VALUE,bar)
...which is not quite right. I know I need to use something like a negative look ahead - but can't quite get the syntax right. Any advice you could offer would be greatly appreciated, thanks. I also need to do the same replacement incidentally at the end of the line even if the comma isn't found as well please (although I appreciate that may require a separate question and expression). Thanks in advance for any advice offered....
Use the $ special character to match the end of the line, and the . special character to match the last character before that:
sed 's/\(SEARCH_STRING .*\)\(.\)$/\1INSERTED_VALUE\2/'
You could replace the final dot in the match expression with a comma if you know that this is always going to be the character you want to replace. If that last character varies, then using dot will match any such character. One downside, however, is that it also matches whitespace, so if your line has a few extra spaces after the comma, this expression will delete a space, not the comma.
To replace the last non-whitespace character, use this expression instead:
sed 's/\(SEARCH_STRING .*\)\(\S\s*\)$/\1INSERTED_VALUE\2/'
The simplest would be to use a lookahead SEARCH_STRING .*(?=,) but sed does not support this, instead you can do something like this:
sed -e 's/\(SEARCH_STRING .*\)\(,.*\)/\1 INSERTED_VALUE\2/'
Basically we make a backreference what comes before and after the last comma, and then piece back it together with INSERTED_VALUE in the middle.
I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
I have a string
test:growTest:ret
And with sed i would to delete only test: to get :
growTest:ret
I tried with
sed '0,/RE/s/^.*://'
But it only gives me
ret
Any ideas ?
Thanks
Modify your regexp ^.*: to ^[^:]*:
All you need is that the .* construction won't consume your delimiter — the colon. To do this, replace matching-any-char . with negated brackets: [^abc], that match any char except specified.
Also, don't confuse the two circumflexes ^, as they have different meanings: first one matches beginning of string, second one means negated brackets.
If I understand your question, you want strings like test:growTest:ret to become growTest:ret.
You can use:
sed -i 's/test:(.*$)/\1/'
i means edit in place.
s/one/two/ replaces occurences of one with two.
So this replaces "test:(.*$)" with "\1". Where \1 is the contents of the first group, which is what the regex matched inside the braces.
"test:(.*$)" matches the first occurence of "test:" and then puts everything else until the end of the line unto the braces. The contents of the braces remain after the sed command.
Sed use hungry match. So ^.*: will match test:growTest: other than test:.
Default, sed only replace the first matched pattern. So you need not do anything specially.