Sed to match any aws arn regex - regex

I am trying to find aws arn from logs and change its color
regex I tried
echo 'arn:aws:lambda:us-east-2:421142534505:function:list-users"' | sed -e "s/arn:\S*[^\s\"]/$(tput bold setaf 1)&$(tput setaf 9)/gi"
result I got
I want arn:aws:lambda:us-east-2:421142534505:function:list-users to be colored (any whitespace or " should not be colored)

Using sed
$ sed "s/arn:aws[^\" ]*/$(tput setaf 1)&$(tput sgr 0)/g" input_file

Related

Using sed for extracting substring from string

I just started using sed from doing regex. I wanted to extract XXXXXX from *****/XXXXXX> so I was following
sed -n "/^/*/(\S*\).>$/p"
If I do so I get following error
sed: 1: "/^//(\S).>$/p": invalid command code *
I am not sure what am I missing here.
Try:
$ echo '*****/XXXXXX>' | sed 's|.*/||; s|>.*||'
XXXXXX
The substitute command s|.*/|| removes everything up to the last / in the string. The substitute command s|>.*|| removes everything from the first > in the string that remains to the end of the line.
Or:
$ echo '*****/XXXXXX>' | sed -E 's|.*/(.*)>|\1|'
XXXXXX
The substitute command s|.*/(.*)>|\1| captures whatever is between the last / and the last > and saves it in group 1. That is then replaced with group 1, \1.
In my opinion awk performs better this task. Using -F you can use multiple delimiters such as "/" and ">":
echo "*****/XXXXXX>" | awk -F'/|>' '{print $1}'
Of course you could use sed, but it's more complicated to understand. First I'm removing the first part (delimited by "/") and after the second one (delimited by ">"):
echo "*****/XXXXXX>" | sed -e s/.*[/]// -e s/\>//
Both will bring the expected result: XXXXXX.
with grep if you have pcre option
$ echo '*****/XXXXXX>' | grep -oP '/\K[^>]+'
XXXXXX
/\K positive lookbehind / - not part of output
[^>]+ characters other than >
echo '*****/XXXXXX>' |sed 's/^.*\/\|>$//g'
XXXXXX
Start from start of the line, then proceed till lask / ALSO find > followed by EOL , if any of these found then replace it with blank.

Apply regex on matched substring

I have few thousands of text lines like this:
go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>
My desired output is to replace everything from the first tagged word (rome) to the last one (italy) and put tag:
go to <ADDRESS>rome italy</ADDRESS>
I can match the portion of the text line which is tagged with:
<.*>
This will greedily select all text from first < to last >. I would like then the tags removed and put <ADDRESS> and </ADDRESS> around the matched portion.
The possible tags are: <STREETNUM>, <STREET>, <CITY>, <STATE>, <ZIP> and <COUNTRY>. Any subset of these tags can appear and in any order. The tags are never nested.
I have searched SO and googled to no avail. Perhaps I can use a named capturing group and then apply search/replace regex on it but I don't know how. Any help would appreciated.
This sed line will do it:
sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'
For example:
sed 's/<CITY>\(.*\)<\/CITY>.*<COUNTRY>\(.*\)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g' <<< "go to <CITY>rome</CITY> <COUNTRY>italy</COUNTRY>"
It prints:
go to <ADDRESS>rome italy</ADDRESS>
It basically captures what is inside the CITY tag and inside the COUNTRY tag and then replace them with the captured groups values enclose the ADDRESS tag
If you're using Linux, you can avoid escaping ( using the -E flag:
sed -E 's/<CITY>(.*)<\/CITY>.*<COUNTRY>(.*)<\/COUNTRY>/<ADDRESS>\1 \2<\/ADDRESS> /g'
UPDATE:
To achieve the expected result you could use several commands in the following order of operation:
Remove the go to text: sed 's/go to //g'
Remove all the tag characters: tr -d '</>'
Once all tag chars are removed, you can safely delete the words STREETNUM, STREET, CITY, STATE, ZIP and COUNTRY from the input:
sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g'
Take the output generated from the previous commands concatenation and output it inside the <ADDRESS></ADDRESS> tags:
xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
The final command is the following, here $LINE should contain the line to process:
sed 's/go to //g' <<< "$LINE" | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
An example:
Running:
sed 's/go to //g' <<< "go to <STATE>Bolivar</STATE> <COUNTRY>Venezuela</COUNTRY> <STREETNUM>5</STREETNUM> " | tr -d '</>' | sed -E 's/CITY|COUNTRY|STATE|ZIP|STREETNUM|STREET//g' | xargs -i echo "go to <ADDRESS>{}</ADDRESS>"
Will print:
go to <ADDRESS>Bolivar Venezuela 5 </ADDRESS>

Linux SED RegEx replace, but keep wildcards

If I have a string that contains this somewhere (Foo could be anything):
<tag>Foo</tag>
How would I, using SED and RegEx, replace it with this:
[tag]Foo[/tag]
My failed attempt:
echo "<tag>Foo</tag>" | sed "s/<tag>\(.*\)<\\/tag>/[tag]\1[\\/tag]"
Your regex is missing the terminating /
$ echo "<tag>Foo</tag>" | sed "s/<tag>\(.*\)<\\/tag>/[tag]\1[\\/tag]/"
[tag]Foo[/tag]
With this you can replace all types of tags and don't have to be tag specific.
$echo "<tag>Foo</tag>" | sed "s/[^<]*<\([^>]*\)>\([^<]*\)<\([^>]*\)>/[\1]\2[\3]/"
hope this helps.

Removing cruft lines with sed

How can you write a single sed command that will remove lines that contain any of several regular expressions?
For example I want sed to remove "/./. ::", ":: ", "::foo", and "^^bar" from a document.
As of now, when I run
sed -ir '/ "//.//. ::|:: _|::foo_|^^bar" /d' text.file
the response is "unknown command '/'".
This is the case with or without the inner "s around the regex:
sed -ir '/ //.//. ::|:: _|::foo_|^^bar /d' text.file
Also if I remove the escape(/) before the '/' such as:
sed -ir '/ "/./. ::|:: _|::foo_|^^bar" /d' text.file
the return becomes "unknown command '.'"
You should be using the escape symbol: \
And there's no need to quote the pattern match or add underscores, simply try:
sed -i -r '/\/.\/. ::|:: |::foo|\^\^bar/d' file.txt
Also, you may want to consider escaping the . symbols. The dot would otherwise match any character.
HTH
sed '{
s/::foo_//
s/\^\^bar//
s!/./. ::!!
s/:: _//
}' input_file
To totally delete lines containing patterns:
sed '{
/::foo_/d
/\^\^bar/d
\!/./. ::!d
/:: _/d
}' input_file

Match domain name from url (www.google.com=google)

So I want to match just the domain from ether:
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
Output should be for all 3: google
I got this code working for just .com
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'
Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)
I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \( \) …)
This will output "google" in all cases:
sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"
Edit:
This version will handle URLs like "'http://google.com.cn/test" and "http://www.google.co.uk/" as well as the ones in the original question:
sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"
This version will handle cases that don't include "http://" (plus the others):
sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"
if you have Python, you can use urlparse module
import urlparse
for http in open("file"):
o = urlparse.urlparse(http)
d = o.netloc.split(".")
if "www" in o.netloc:
print d[1]
else:
print d[0]
output
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
$ ./python.py
google
google
google
or you can use awk
awk -F"/" '{
gsub(/http:\/\/|\/.*$/,"")
split($0,d,".")
if(d[1]~/www/){
print d[2]
}else{
print d[1]
}
} ' file
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test
$ ./shell.sh
google
google
google
google
google
s|http://(www\.)?([^.]*)|$2|
It's Perl with alternate delimiters (because it makes it more legible), I'm sure you can port it to sed or whatever you need.
#! /bin/bash
urls=( \
http://www.google.com/test/ \
http://google.com/test/ \
http://google.net/test/ \
)
for url in ${urls[#]}; do
echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done
Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).
Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.
echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"