echo "xxabc jkl" | grep -onP '\w+(?!abc\b)'
1:xxabc
1:jkl
Why the result is not as below?
echo "xxabc jkl" | grep -onP '\w+(?!abc\b)'
1:jkl
The first string is xxabc which ending with abc.
I want to extract all characters which not ending with abc,why xxabc matched?
How to fix it,that is to say get only 1:jkl as output?
Why '\w+(?!abc\b)' can't work?
The \w+(?!abc\b) pattern matches xxabc because \w+ matches 1 or more word chars greedily, and thus grabs xxabc at once. Then, the negative lookahead (?!abc\b) makes sure there is no abc with a trailing word boundary immediately to the left of the current location. Since after xxabc there is no abc with a trailing word boundary, the match succeeds.
To match all words that do not end with abc using a PCRE regex, you may use
echo "xxabc jkl" | grep -onP '\b\w+\b(?<!abc)'
See the online demo
Details
\b - a leading word boundary
\w+ - 1 or more word chars
\b - a trailing word boundary
(?<!abc) - a negative lookbehind that fails the match if the 3 letters immediately to the left of the current location are abc.
Without pcregrep special features, you can do it adding a pipe to sed:
echo "xxabc jkl" | sed 's/[a-zA-Z]*abc//g' | grep -onE '[a-zA-Z]+'
or with awk:
echo "xxabc jkl" | awk -F'[^a-zA-Z]+' '{for(i=1;i<=NF;i++){ if ($i!~/abc$/) printf "%s: %s\n",NR,$i }}'
other approach:
echo "xxabc jkl" | awk -F'([^a-zA-Z]|[a-zA-Z]*abc\\>)+' '{OFS="\n"NR": ";if ($1) printf OFS;$1=$1}1'
Related
Consider the following file.txt:
#A00940:70:HTCYYDRXX:2:2101:1561:1063 1:N:0:ATCACG
TAGCACTGGGCTGTGAGACTGTCGTGTGTGCTTTGGATCAAGCAAGATCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFF::FFFFFFF:FFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2175:1063 1:N:0:ATCACG
CGCCCCCTCCTCCGGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGA
+
FFFF:FFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2772:1063 1:N:0:ATCACG
TGGTGGCAGGCACCTGTAATCCCAGCTACTCGGGAGCCTGAGGCAGGAGAA
I am trying to grep all the characters of the lines that start with # up to : but not including the colon, in this example the result would be A00940.
I have tried this:
cat file.txt | grep '[^:]*'
and this:
cat file.txt | grep '^(.*?):'
but both commands do not work, why is that?
With your shown samples, could you please try following.
awk 'match($0,/#[^:]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
With sed solution: Simply stop printing for all lines and print only those lines where match is found by regex.
sed -n 's/^#\([^:]*\):.*/\1/p' Input_file
This pattern [^:]* matches 0+ times any character except : which does not take an # char into account. As the quantifier is * it can also match an empty string.
This pattern ^(.*?): matches from the start of the string, as least as possible characters till the first occurrence of : and also does not take the # char into account.
One option is to use -P for a Perl compatible regex with a positive lookbehind to assert an # to the left.
grep -oP '(?<=#)[^#:]+' file.txt
The pattern matches:
(?<=#) Positive lookbehind, assert from the current position an # directly to the left
[^#:]+ Negated character class, match 1+ times any character except # and :
Output
A00940
A00940
A00940
Another option using gawk with a capture group:
gawk 'match($0, /#([^#:]+):/, a) {print a[1]}' file.txt
In addition to the other answers, another way to get the output is with the following
awk -F '#|:' '$2=="A00940" {print $2}' file.txt
That sets the delimiter as either # or : and then prints the second column where the value is A00940:
Output:
A00940
A00940
A00940
Can you please tell me how to get the token value correctly? At the moment I am getting: "1jdq_dnkjKJNdo829n4-xnkwe",258],["FbtResult
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | sed -n 's/.*"token":\([^}]*\)\}/\1/p'
You need to match the full string, and to get rid of double quotes, you need to match a " before the token and use a negated bracket expression [^"] instead of [^}]:
sed -n 's/.*"token":"\([^"]*\).*/\1/p'
Details:
.* - any zero or more chars
"token":" - a literal "token":" string
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
.* - any zero or more chars.
This replacement works:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult'
| sed -n 's/.*"token":"\([a-z]*\)"\}.*/\1/p'
Key capture after "token" found between quotes via \([a-z]*\), followed by a closing brace \} and remaining characters after that as .* (you were missing this part before, which caused the replacement to include the text after keyword as well).
Output:
aaaaaaa
A grep solution:
echo '{"facebookdotcom":true,"messengerdotcom":false,"workplacedotcom":false},827],["DTSGInitialData",[],{"token":"aaaaaaa"},258],["FbtResult' | grep -Po '(?<="token":")[^"]+'
yields
aaaaaaa
The -P option to grep enables the Perl-compatible regex (PCRE).
The -o option tells grep to print only the matched substring, not the entire line.
The regex (?<="token":") is a PCRE-specific feature called a zero-width positive lookbehind assertion. The expression (?<=pattern) matches a pattern without including it in the matched result.
I ran this example in my bash terminal:
echo "ab" | egrep '\a\b'
Output
ab
then I ran this one:
echo "a" | egrep '\a\b'
Output
a
I was confused. Why did I get output?
But then I input another one:
echo "a" | egrep '\ab'
and didn't get output.
What is the difference between \a\b and \ab for Extended RegEx?
P.S
Regex didn't ignore the first letter:
echo "b" | egrep '\a\b'
Output is empty
Regex acted as I expected, but what is the difference from the first case?
P.P.S
I found out that this example:
echo 'ab' | egrep '\a\b'
also does not output anything. In the first case (echo "ab" | egrep '\a\b'), the output was ab. How can quotes affect regex?
The '\a\b' makes \a\b regex where \a is an unknown escape sequence and matches an a and \b is a word boundary.
echo "a" | egrep '\a\b' - correctly outputs a, there is an a not followed with a digit, letter or underscore.
echo "a" | egrep '\ab' - correctly does not output anything, a does not contain ab.
echo "b" | egrep '\a\b' - correctly does not output anything, b does not contain a.
As for echo "ab" | egrep '\a\b', it should not output any result, ab does not contain a at a word boundary, there is b after a.
Other letters that means something else when escaped (GNU grep, POSIX):
\B - a position other than a word boundary position
\s - whitespace
\S - non-whitespace
\w - letters, digits, and underscores
\W - characters different from letters, digits, and underscores.
Also, there is \< (start of word) and \> (end of word).
If you use -P, PCRE pattern, there are even more escape sequences available (see pcresyntax and pcrepattern).
I have a series of phone numbers that I need to remove the leading zeros and any spaces between the numbers from.
So far I have got the following Regex expression:
^[^1-9A-Z]+
But need to add the ability to remove any spaces between numbers.
For example I want
012345 6789
to become:
123456789
Any help would be greatly appreciated!
You replace this pattern with an empty string:
^0+|\s+
The regex consists of two parts, ^0+ and \s+. The regex will match when any of these parts matches. ^0+ matches one or more 0s at the start of the string. \s+ matches one or more whitespace characters.
Just use | to select either the leading zeros/non-number/non-word characters, OR any spaces, and replace with the empty string:
/^[^1-9A-Z]+|\s/
https://regex101.com/r/j9VJSA/1
$ echo "012345 6789" | sed -e 's/^0*//' -e 's/\([0-9]*\)\s\([0-9]*\)/\1\2/g'
123456789
$ echo "012345 6789 876" | sed -e 's/^0*//' -e 's/\([0-9]*\)\s\([0-9]*\)/\1\2/g'
123456789876
$ echo "0012345 6789 876 54" | sed -e 's/^0*//' -e 's/\([0-9]*\)\s\([0-9]*\)/\1\2/g'
12345678987654
Explain:
^0* to match all leading 0
\s means space
[0-9] to match any single number
[0-9]* to match any numbers
\( and \) use to group
\1 and \2 to pick the group
I want to extract number from string. This is the string
#all/30
All I want is 30. How can I extract?
I try to use :
echo "#all/30" | sed 's/.*\/([^0-9])\..*//'
But nothing happen.
How should I write for the regular expression?
Sorry for bad english.
You may consider using grep to extract the numbers from a simple string like this.
echo "#all/30" | grep -o '[0-9]\+'
-o option shows only the matching part that matches the pattern.
You could try the below sed command,
$ echo "#all/30" | sed 's/[^0-9]*\([0-9]\+\)[^0-9]*/\1/'
30
[^0-9]* [^...] is a negated character class. It matches any character but not the one inside the negated character class. [^0-9]* matches zero or more non-digit characters.
\([0-9]\+\) Captures one or more digit characters.
[^0-9]* Matches zero or more non-digit characters.
Replacing the matched characters with the chars inside group 1 will give you the number 30
echo "all/30" | sed 's/[^0-9]*\/\([0-9][0-9]*\)/\1/'
Avoid writing '.*' as it consumes entire string. Default matches are always greedy.
echo "all/30" | sed 's/[^0-9]*//g'
# OR
echo "all/30" | sed 's#.*/##'
# OR
echo "all/30" | sed 's#.*\([0-9]*\)#\1#'
without more info about possible input string we can only assume that structure is #all/ followed by the number (only)