how to exclude comma in my regular expression - regex

my character set is
-68,-79,-72,-70,-71,-71,-71,-71,-72,-73,R2,0000feaa-0000-1000-8000-00805f9b34fb
I want like
-68 -79 -73
and my regular expression is
[-][0-9]{2}[^0-9]
and result like
-68, -79,
I want to exclude comma in my character set
how can I solve my problem
Thank you for your help

Based on your regex and your results, I assume you are finding multiple matches and then putting spaces between each match. Let me break down what your regex is doing:
[-] matches the negative sign
[0-9]{2} matches two digits
[^0-9] matches any non-digit character, including a comma. So the commas are part of your match
If you want to exclude the commas from your match, but still assert that they are there, you need to use a positive lookahead. This is done like so:
[-][0-9]{2}(?=[^0-9])

Already said this in the comments but will post answer just for the sake of completion.
The solution to this isn't exactly regex. It's the replace function of whatever tool you're using. All you have to do is replace the , by a (space).
For example, in python .replace(',', ' ') is sufficient

which language are you using?
For example:
sed
echo "-34,-35,-34" | sed 's/,/ /g'
awk
echo "-34,-35,-34" | awk '{gsub(/,/, " ", $0); print $0}'

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

Match everything to the right of a particular character

I want to match a regex like:
.+|(.+)
but sometimes the input is like:
.+|.+|.+|.+|.+
In other words, I don't know how many pipe characters | are in the input string, but I know I want to extract whatever is to the right of the rightmost |.
In other words, I don't know how many pipe characters | are in the input string, but I know I want to extract whatever is to the right of the rightmost |
You can use the following:
[^|]+$
Regular expression:
[^|]+ any character except: '|' (1 or more times)
$ before an optional \n, and the end of the string
So for example using grep:
echo ".+|.+|.+|.+|foo" | grep -Eo '[^|]+$'
# => 'foo'
You could also use a one-liner to do this, Example:
perl -nle 'print $_ for (split /\|/)[-1]' file
Assuming the end of your example is the end of the string/line, you can specify the end of line to get the value on the right of the rightmost pipe:
^.+\|(.+)$
Demo: http://regex101.com/r/wR3lP2
Instead of using ., use a character class matching any character other than |:
^.+\|([^|]+)$
You may want to use ^.+\|([^|]+).*$, which in the the case the string ends with |, will capture like so:
.+|.+|.+|.+|.+|
Captures .+
The other suggestions that don't negate the | captures .+|
Consider the following Regex...
\|(\.\+)$
Good Luck!
or use this simple pattern ([^|]+)$

grep for words ending in 'ing' immediately after a comma

I am trying to grep files for lines with a word ending in 'ing' immediately after a comma, of the form:
... we gave the dog a bone, showing great generosity ...
... this man, having no home ...
but not:
... this is a great place, we are having a good time ...
I would like to find instances where the 'ing' word is the first word after a comma. It seems like this should be very doable in grep, but I haven't figured out how, or found a similar example.
I have tried
grep -e ", .*ing"
which matches multiple words after the comma. Commands like
grep -i -e ", [a-z]{1,}ing"
grep -i -e ", [a-z][a-z]+ing"
don't do what I expect--they don't match phrases like my first two examples. Any help with this (or pointers to a better tool) would be much appreciated.
Try ,\s*\S+ing
Matches your first two phrases, doesn't match in your third phrase.
\s means 'any whitespace', * means 0 or more of that, \S means 'any non-whitespace' (capitalizing the letter is conventional for inverting the character set in regexes - works for \b \s \w \d), + means 'one or more' and then we match ing.
You can use the \b token to match on word boundaries (see this page).
Something like the following should work:
grep -e ".*, \b\w*ing\b"
EDIT: Except now I realised that the \b is unnecessary, and .*,\s*\w*ing would work, as Patashu pointed out. My regex-fu is rusty.

REGEX Remove Space

I want to creating regex to remove some matching string, the string is phone number
Example user input phone number like this:
+jfalkjfkl saj f62 81 7876 asdadad30 asasda36
then output will be like this:
628178763036
at the moment with my current regex ^[\+\sa-zA-Z]+ it can select the part +jfalkjfkl saj f
What is the regex so it also can select the space bewteen number?
e.g:
62(select the space here)81, 81(select the space here)7876
I don't know what language you plan on using this in, but you can replace this pattern:
[^\d]+, with an empty string should accomplish this. It'll remove everything that's not a number.
Using PCRE regexes, you should be able to simply remove anything matching \D+. Example:
echo "+jfalkjfkl saj f62 81 7876 asdadad30 asasda36" | perl -pe 's/\D+//g'
prints:
628178763036
It would appear that you need two operations:
Remove everything that is neither a blank nor a digit:
s/[^ \d]//g;
Remove all extra blanks:
s/ +/ /g;
If you need to remove leading and trailing blanks too:
s/^ //;
s/ $//;
(after the replace multiple blanks with a single blank).
You can use \s to represent more space-like characters than just a blank.
Use a look-behind and a look-ahead to assert that digits must precede/follow the space(s):
(?<=\d) +(?=\d)
The entire regex matches the spaces, so no need to reference groups in your replacement, just replace with a blank.
If you make a replace you can reconstruct the phone number with the space between numbers:
search: \D*(\d+)\D*?(\s?)
replace: $1$2

Extract strings between two separators using regex in perl

I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2