Extracting substring in bash script

Extracting substring in bash script - regex

I am not that good at bash scripting. I have a requirement to extract a substring between two words of a string. I tried different ways. Could some one help me pls?
This is my text "RegionName": "eu-west-1", "LatestAmiId": "ami-0ebfeadd9ccacfbb2",
Remember the the quotes and comma are the part of String. I need to extract the AMI ID alone, Means text between "LatestAmiId": " and ",
Any help pls?

Assuming you have this string stored in a variable name input_text you can get the AmiId using sed like this
ami_id=$(echo "$input_text" | sed -e 's/.*LatestAmiId": "//' -e 's/",$//')
this uses two different sed scripts:
s/.*LatestAmiId": "// replaces all text up to and including LatestAmiId": " with nothing
s/",$// replaces the ", at the end of the line with nothing
As I mentioned in comments, jq is a tool that I have found really helpful when working with JSON objects in bash scripts. Since your input string looks like a section out of a json response from an AWS api, I highly recommend using a json tool rather than a regex to extract this information.

Related

Insert newline before/after match for TSV

I'm going grey trying to figure out how to accomplish some regex matching to insert new lines. Example input/output below...
Example TSV Data:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001 some:other:tag:with-colons-and-hypens=MACHINE NAME Name=NAMETAG backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01 backup=true Name=SOMENAME"
Desired Output:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hyphens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
I can guarantee each key=value within those quotes are separated by hard/literal tabs, although it may not appear that way with how the StackOverflow code block is displayed in HTML they did carry over into the code block editor, the data under the column Tags is in quotes so that even though they are tab separated they stay within the Tags column. For whatever reason I'm not able to successfully get the desired results.
In my measly attempts, I've been basically capturing everything between the "" as if tabs aren't separated in my regex searches because of my use of wildcards [TAB].*=.*[TAB] is obviously not working because then I'm losing everything in between the first/last occurrence for each line. I've attempted storing them in capture groups without any success.
I'm looking for a unix toolset solution (sed, awk, perl and the like). Any/All help is appreciated!

This will work using any awk in any shell on any UNIX box:
$ awk 'match($0,/".*"/){str=substr($0,RSTART,RLENGTH); gsub(/\t/,"\n",str); $0=substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)} 1' file
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hypens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
It just extracts a string between "s from the current record, replaces all tabs with newlines within that string, then puts the record back together before it's printed.

You can try this sed (GNU sed) 4.4
sed -E ':A;s/(".*)\t(.*")/\1\n\2/;tA' TSV_Data_File
With OSX sed, you can try this one.
I think the \t is ok.
sed -E '
:A
s/(".*)\t(.*")/\1\
\2/
tA
' TSV_Data_File
brief explain :
Catch the text inside "
Substitute the last \t by \n
If a substitution occur jump to A else continue
With awk :
awk -v RS='"' 'NR%2==0{gsub("\t","\n")}1' ORS='"' TSV_Data_File

This is basically ctac_'s awk answer converted to perl:
perl -pe'1 while s/(".*)\t(.*")/$1\n$2/s' file.tsv
Where the \t might be replaced by \t\s* if you want just one newline out of each tab-and-then-some.

This might work for you (GNU sed):
sed 's/\S\+=\S\+/\n&/2g' file
Insert a newline in before the second or more non-empty strings containing an =.

Sed or awk Find a string in the last 100 characters of a line or delete the line

first question so hopefully I form it well.
I'm looking to match a string, namely "lang":"en" in the last 100 characters of a line and if there's no match, delete the line.
I have tried using sed by doing
sed '/"lang":"en"/!d' file > output
But unfortunately many lines have that string more than once and I only care about the final occurrence of it.
I'm learning sed still, but don't know anything about awk and most of my searches have come up with "first/last instance in a file" rather than "in a line" so any help in learning the best method to do this would be great. thanks.

This should work with any Posix awk:
awk 'match(substr($0,length-99),/"lang":"en"/)' file
You can do it with a simple string find, instead of a regular expression, but the string is more annoying to type:
awk 'index(substr($0,length-99),"\"lang\":\"en\"")' file
Both simply extract the last 100 characters of each line, and if the test pattern is found in the substring, print the line (print is the default action, so the program consists only of the condition.)

For a simple regex-based solution,
grep -E '"lang":"en".{0,89}$' file
I subtracted the length of "lang":"en" from the maximum amount, assuming you mean the string must be found entirely within the last 100 characters.
This looks like you are attempting to process JSON data, so perhaps you can come up with a better, structure-based rule, and use jq instead.
jq 'select(path["to"]["lang"] == "en")' file
to find "en" in the structure "path": { ... "to": { ..., "lang": "en" ...} } }. This will also be robust against newlines in the JSON, spacing variations in "lang": "en", etc.

sed '/"lang":"en".\{0,89\}$/!d' file > output
Add the possible 89 other char before the end in the selection

search and print a word from a file

I am trying to capture a word with a static search string.
Search String: customfield_12345
Here is the source file that i am trying to feed to awk script:
Infput file: abc.log
{"expand":"hello,foo,boo,doo","id":"546546","self":"http://localhost/abc/rest/api/latest/issue/12345","key":"abcd-4567","fields":{"customfield_12345":"$D21.0/dfgdf/string_to_capture_from_file "}}
Query: awk '{for(i=1;i<=NF;i++){if($i~/^customfield_12345/){print $i}}}' abc.log
Expected output: string_to_capture_from_file
I thought to use combination of grep and cat, but somehow option "-o" is not supprted on all platforms.

awk is not the best tool for your case. Figuring out the relvant separators might be a pain, and using a JSON parser as suggested by the other answer would be easier.
However, in your specific case, you can modify your query as follow :
MYVAR=awk -F'":"|","|{"|"}' '{for(i=1;i<=NF;i++){if($i~/customfield_12345/){i++;print $i}}}' test
echo ${MYVAR##*/}
-F allows us to set ":", ",", {" and "} as internal fields separators. When awk encounters one of those patterns, it will split the line in several columns.
This will return $D21.0/dfgdf/string_to_capture_from_file, which you can later parse with bash using echo ${MYVAR##*/}

Your input file contains a JSON string, so I would parse it as JSON instead of using a regex:
python -c "import json;json_data=open('abc.log');data = json.load(json_data);print data['fields']['customfield_12345'];json_data.close();"

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules

Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;

I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.

I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.

Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"

You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"

You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2

sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!

I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.

Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.

Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest

http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt

I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt

This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting substring in bash script - regex

Related

Insert newline before/after match for TSV

Sed or awk Find a string in the last 100 characters of a line or delete the line

search and print a word from a file

Regular expression to extract text from XML-ish data using GNU sed

Regular Expression to parse Common Name from Distinguished Name

Categories

Resources