grep a pattern until a specific character (:) - regex

Consider the following file.txt:
#A00940:70:HTCYYDRXX:2:2101:1561:1063 1:N:0:ATCACG
TAGCACTGGGCTGTGAGACTGTCGTGTGTGCTTTGGATCAAGCAAGATCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFF::FFFFFFF:FFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2175:1063 1:N:0:ATCACG
CGCCCCCTCCTCCGGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGA
+
FFFF:FFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2772:1063 1:N:0:ATCACG
TGGTGGCAGGCACCTGTAATCCCAGCTACTCGGGAGCCTGAGGCAGGAGAA
I am trying to grep all the characters of the lines that start with # up to : but not including the colon, in this example the result would be A00940.
I have tried this:
cat file.txt | grep '[^:]*'
and this:
cat file.txt | grep '^(.*?):'
but both commands do not work, why is that?

With your shown samples, could you please try following.
awk 'match($0,/#[^:]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
With sed solution: Simply stop printing for all lines and print only those lines where match is found by regex.
sed -n 's/^#\([^:]*\):.*/\1/p' Input_file

This pattern [^:]* matches 0+ times any character except : which does not take an # char into account. As the quantifier is * it can also match an empty string.
This pattern ^(.*?): matches from the start of the string, as least as possible characters till the first occurrence of : and also does not take the # char into account.
One option is to use -P for a Perl compatible regex with a positive lookbehind to assert an # to the left.
grep -oP '(?<=#)[^#:]+' file.txt
The pattern matches:
(?<=#) Positive lookbehind, assert from the current position an # directly to the left
[^#:]+ Negated character class, match 1+ times any character except # and :
Output
A00940
A00940
A00940
Another option using gawk with a capture group:
gawk 'match($0, /#([^#:]+):/, a) {print a[1]}' file.txt

In addition to the other answers, another way to get the output is with the following
awk -F '#|:' '$2=="A00940" {print $2}' file.txt
That sets the delimiter as either # or : and then prints the second column where the value is A00940:
Output:
A00940
A00940
A00940

Related

Lazy Grep -P: How to show only to the 1st match from the lines

I need to print only the 1st match from each line.
My file contains text something like this:
cat t.txt
abcsuahrcb
abscuharcb
bsaucharcb
absuhcrcab
He is the command I am trying with:
cat t.txt | grep -oP 'a.*?c'
It gives:
abc
ahrc
absc
arc
auc
arc
absuhc
I need it to return:
abc
absc
auc
absuhc
These are the 1st possible matches from each line.
Any other alternatives like sed and aws will work, but not something which needs to be installed on Ubuntu.
Perl to the rescue:
perl -lne 'print $1 if /(a.*?c)/' t.txt
-n reads the input line by line, running the code for each;
-l removes newlines from input lines and adds them to output;
The code tries to match a.*?c, if matched, it stores the result in $1;
As there's no loop, only one match per line is attempted.
A sed variation on The fourth bird's answer:
$ sed -En 's/^[^a]*(a[^c]*c).*/\1/p' t.txt
abc
absc
auc
absuhc
Where:
-En - enable extended regex support, suppress automatic printing of pattern space
^[^a]* - from start of line match all follow-on characters that are not a
(a[^c]*c) - (1st capture group) match letter a plus all follow-on characters that are not c followed by a c
.* - match rest of line
\1/p - print contents of 1st capture group
One awk idea:
$ awk 'match($0,/a[^c]*c/) { print substr($0,RSTART,RLENGTH)}' t.txt
abc
absc
auc
absuhc
Where:
if we find a match then the match() call is non-zero (ie, 'true') so ...
print the substring defined by the RSTART/RLENGTH variables (which are auto-populated by a successful match() call)
Using grep you could write the pattern as matching from the first a to the first c using a negated character class.
Using -P for Perl-compatible regular expressions, you can make use of \K to forget what is matched so far.
Note that you don't have to use cat but you can add the filename at the end.
grep -oP '^[^a]*\Ka[^c]*c' t.txt
The pattern matches:
^ Start of string
[^a]* Optionally match any char except a
\K Forget what is matched so far
a Match literally
[^c]* Optionally match any char except c
c Match literally
Output
abc
absc
auc
absuhc
Another option with gnu-awk and the same pattern, only now using and printing the capture group 1 value:
awk 'match($0,/^[^a]*(a[^c]*c)/, a) { print a[1]}' t.txt

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).
With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"
You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.
With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.
A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45
If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45
Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45
This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

Capture word after pattern with slash

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...
Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output
1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file
Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output
Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

linux extract only a string starts with a special string and ends with the first occurrence of comma

I have a log file contains some information like below
"variable1=XXX, emotionType=sad, sentimentType=negative..."
What I want is to grep only the matched string, the string starts with emotionType and ends with the first occurrence of comma.
E.g.
emotionType=sad
emotionType=joy
...
What I have tried is
grep -e "/^emotionType.*,/" file.log -o
but I got nothing. Anyone can tell me what should I do?
You need to use
grep -o "emotionType[^,]*" file.log
Note:
Remove ^ or replace with \<, starting word boundary construct if your matches are not located at the beginning of each line
Remove the / chars on both ends of the regex since grep does not use regex delimiters (like sed)
[^,] is a negated bracket expression that matches any char other than a comma
* is a POSIX BRE quantifier that matches zero or more occurrences.
See an online demo:
#!/bin/bash
s="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
grep -o "emotionType=[^,]*" <<< "$s"
Output:
emotionType=sad
emotionType=happy
1st solution: With awk you could try following program. Simple explanation would be using awk's match function capability and using regex to match string emotionType till next occurrence of , and printing all the matches in awk program.
var="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
Where var is a shell variable.
echo "$var" |
awk '{while(match($0,/emotionType=[^,]*/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}'
2nd solution: Or in GNU awk using RS variable try following awk program.
echo "$var" | awk -v RS='emotionType=[^,]*' 'RT{sub(/\n+$/,"",RT);print RT}'