grep or sed for word containing string - regex

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?

You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input

Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

linux extract only a string starts with a special string and ends with the first occurrence of comma

I have a log file contains some information like below
"variable1=XXX, emotionType=sad, sentimentType=negative..."
What I want is to grep only the matched string, the string starts with emotionType and ends with the first occurrence of comma.
E.g.
emotionType=sad
emotionType=joy
...
What I have tried is
grep -e "/^emotionType.*,/" file.log -o
but I got nothing. Anyone can tell me what should I do?
You need to use
grep -o "emotionType[^,]*" file.log
Note:
Remove ^ or replace with \<, starting word boundary construct if your matches are not located at the beginning of each line
Remove the / chars on both ends of the regex since grep does not use regex delimiters (like sed)
[^,] is a negated bracket expression that matches any char other than a comma
* is a POSIX BRE quantifier that matches zero or more occurrences.
See an online demo:
#!/bin/bash
s="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
grep -o "emotionType=[^,]*" <<< "$s"
Output:
emotionType=sad
emotionType=happy
1st solution: With awk you could try following program. Simple explanation would be using awk's match function capability and using regex to match string emotionType till next occurrence of , and printing all the matches in awk program.
var="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
Where var is a shell variable.
echo "$var" |
awk '{while(match($0,/emotionType=[^,]*/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}'
2nd solution: Or in GNU awk using RS variable try following awk program.
echo "$var" | awk -v RS='emotionType=[^,]*' 'RT{sub(/\n+$/,"",RT);print RT}'

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

Regex for uppercase matches with exclusions

I'm trying to come up with a regex for the following case: I need to find any matching paths using grep for the following paths:
Include all uppercase matching paths.
Example:
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Notice the capital B in Bar.
Exclude all uppercase matching paths that only contain SNAPSHOT and have no other uppercase letters.
Example:
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
Is this possible with grep?
Something like this might do:
grep -vE '^([^[:upper:]]*(SNAPSHOT)?)*$'
Breakdown:
-v will reverse the match (show all non matched lines. -E enabled Extended Regular Expressions.
^ # Start of line
( )* # Capturing group repeated zero or more times
[^[:upper:]]* # Match all but uppercase zero or more times
(SNAPSHOT)? # Followed by literal SNAPSHOT zero or one time
$ # End of line
Just use awk:
$ cat file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
com/foo/bar/1.2.3-SNAPSHOT/bar-1.2.3-SNAPSHOT.jar
With GNU awk or mawk for gensub():
$ awk 'gensub(/SNAPSHOT/,"","g")~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
With other awks:
$ awk '{r=$0; gsub(/SNAPSHOT/,"",r)} r~/[[:upper:]]/' file
com/foo/Bar/1.2.3-SNAPSHOT/Bar-1.2.3-SNAPSHOT.jar
Well, you need find to list all paths. Then you can do it with grep with two runs. One includes all capital cases. The other one excludes that contain no capitals except SNAPSHOT:
find . | grep '[A-Z]' | grep -v '.*\/[^A-Z]*SNAPSHOT[^A-Z]*$'
I think only the last grep needs some explanation:
grep -v excludes the matching lines
.*\/ greedily matches everything up to the first slash. There'll always be a slash due to find .
[^A-Z]* finds all characters that are non-capital letters. So we apply it before and after the SNAPSHOT literal, up to the end of the string.
Here you can play with it online.
If you only want to get the matching files. I'll do it like this.
find . -type f -regex '.*[A-Z].*' | while read -r line; do echo "$line" | sed 's/SNAPSHOT//g' | grep -q '.*[A-Z].*' && echo "$line"; done

regular expression to extract number from string

I want to extract number from string. This is the string
#all/30
All I want is 30. How can I extract?
I try to use :
echo "#all/30" | sed 's/.*\/([^0-9])\..*//'
But nothing happen.
How should I write for the regular expression?
Sorry for bad english.
You may consider using grep to extract the numbers from a simple string like this.
echo "#all/30" | grep -o '[0-9]\+'
-o option shows only the matching part that matches the pattern.
You could try the below sed command,
$ echo "#all/30" | sed 's/[^0-9]*\([0-9]\+\)[^0-9]*/\1/'
30
[^0-9]* [^...] is a negated character class. It matches any character but not the one inside the negated character class. [^0-9]* matches zero or more non-digit characters.
\([0-9]\+\) Captures one or more digit characters.
[^0-9]* Matches zero or more non-digit characters.
Replacing the matched characters with the chars inside group 1 will give you the number 30
echo "all/30" | sed 's/[^0-9]*\/\([0-9][0-9]*\)/\1/'
Avoid writing '.*' as it consumes entire string. Default matches are always greedy.
echo "all/30" | sed 's/[^0-9]*//g'
# OR
echo "all/30" | sed 's#.*/##'
# OR
echo "all/30" | sed 's#.*\([0-9]*\)#\1#'
without more info about possible input string we can only assume that structure is #all/ followed by the number (only)