Regexp is not working as expected in unix - regex

I am trying with below code and its not working as expcted. I am new to REGEX. Please share your ideas. Thanks in advance.
test.xml
<?xml version="1.0"?>
<audit>
<interfaces>
<interface_dtls>ABCD,ABCD 123</interface_dtls>
<interface_dtls>TESTING,123 TEST</interface_dtls>
</interfaces>
</audit>
Trying with below unix commands
#!/bin/bash
for line in `cat test.xml | grep -oP "(?<=interface_dtls>)[^<]+"`; do
echo $line --Displaying line only for debugging purpose
interface_code=`echo $line | awk -F ',' '{print $1}'`
prcdr_cd=`echo $line | awk -F ',' '{print $2}'`
hive -e "select * from table \
where sub_sys_cd='$interface_code' and data_prcdr_desc='$prcdr_cd';"
done
Actual "ECHO" output:
ABCD,ABCD
TESTING,123
Expected "ECHO" output:
ABCD,ABCD 123
TESTING,123 TEST
Becuse of missing info(info after space) my query is not working as expected.

Using xml_grep, the more recommended option for parsing, as grep is not not an XML aware tool.
$ xml_grep 'interface_dtls' file --text_only
ABCD,ABCD 123
TESTING,123 TEST
One could also use grep as pointed by anubhava over in comments. Probably not the best of ways to do it, but can done for a one-time debug. For proper functionality use any XML readable commands (e.g xmllint or xml_grep).
$ grep -oP "(?<=<interface_dtls>)[^<]+" xml_file
ABCD,ABCD 123
TESTING,123 TEST
The skeletal code for extracting the individual words from the command can be done as below. I will leave it up to you to tweak it as you need and do not use the outdated `` style command expansion, rather use $ wherever applicable.
#!/bin/bash
while read -r paramA paramB;
do
interface_code=$(echo $paramA | awk -F ',' '{print $1}')
prcdr_cd=$(echo $paramA | awk -F ',' '{print $2}')
echo $interface_code $prcdr_cd
done < <(xml_grep 'interface_dtls' file --text_only)

The xml_grep utility was mentioned in another answer. This uses XMLStarlet, which is also able to validate and modify XML files on the command line:
$ xml sel -t -v '//interface_dtls' -nl data.xml
ABCD,ABCD 123
TESTING,123 TEST

After little bit of research i am able to resolve the issue. But thanks to https://stackoverflow.com/users/5291015/inian , https://stackoverflow.com/users/4941495/kusalananda and https://stackoverflow.com/users/548225/anubhava for helpful insights.
test.xml
<?xml version="1.0"?>
<audit>
<interfaces>
<interface_dtls>ABCD,ABCD 123</interface_dtls>
<interface_dtls>TESTING,123 TEST</interface_dtls>
</interfaces>
</audit>
Before:
#!/bin/bash
for line in `cat test.xml | grep -oP "(?<=interface_dtls>)[^<]+"`; do
echo $line --Displaying line only for debugging purpose
interface_code=`echo $line | awk -F ',' '{print $1}'`
prcdr_cd=`echo $line | awk -F ',' '{print $2}'`
hive -e "select * from table \
where sub_sys_cd='$interface_code' and data_prcdr_desc='$prcdr_cd';"
done
After:
#!/bin/bash
IFS='$\n'
for line in `cat test.xml | grep -oP "(?<=interface_dtls>)[^<]+" | cut -d '>' -f 2 | cut -d '<' -f 1`; do
echo $line --Displaying line only for debugging purpose
interface_code=$(echo $line | awk -F ',' '{print $1}')
prcdr_cd=$(echo $line | awk -F ',' '{print $2}')
hive -e "select * from table \
where sub_sys_cd='$interface_code' and data_prcdr_desc='$prcdr_cd';"
done
"ECHO" output:
ABCD,ABCD 123
TESTING,123 TEST

Related

Grep first line which contain a date

I'm trying to fetch the first line in a log file which contain a date.
Here is an example of the log file :
SOME
LOG
2021-1-1 21:50:19.0|LOG|DESC1
2021-1-4 21:50:19.0|LOG|DESC2
2021-1-5 21:50:19.0|LOG|DESC3
2021-1-5 21:50:19.0|LOG|DESC4
In this context I need to get the following line:
2021-1-1 21:50:19.0|LOG|DESC1
An other log file example :
SOME
LOG
21-1-3 21:50:19.0|LOG|DESC1
21-1-3 21:50:19.0|LOG|DESC2
21-1-4 21:50:19.0|LOG|DESC3
21-1-5 21:50:19.0|LOG|DESC4
I need to fetch :
21-1-3 21:50:19.0|LOG|DESC1
At the moment I tried the following command :
cat /path/to/file | grep "$(date +"%Y-%m-%d")" | tail -1
cat /path/to/file | grep "$(date +"%-Y-%-m-%-d")" | tail -1
cat /path/to/file | grep -E "[0-9]+-[0-9]+-[0-9]" | tail -1
In case you are ok with awk, could you please try following. This will find the matched regex first line and exit from program, which will be faster since its NOT reading whole Input_file.
awk '
/^[0-9]{2}([0-9]{2})?-[0-9]{1,2}-[0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]+/{
print
exit
}' Input_file
Using sed, being not too concerned about exactly how many digits are present:
sed -En '/^[0-9]+-[0-9]+-[0-9]+ [0-9]+:[0-9]+:[0-9]+[.][0-9]+[|]/ {p; q}' file
$ grep -m1 '^[0-9]' file1
2021-1-1 21:50:19.0|LOG|DESC1
$ grep -m1 '^[0-9]' file2
21-1-3 21:50:19.0|LOG|DESC1
If that's not all you need then edit your question to provide more truly representative sample input/output.
A simple grep with -m 1 (to exit after finding first match):
grep -m1 -E '^([0-9]+-){2}[0-9]+ ([0-9]{2}:){2}[0-9]+\.[0-9]+' file1
2021-1-1 21:50:19.0|LOG|DESC1
grep -m1 -E '^([0-9]+-){2}[0-9]+ ([0-9]{2}:){2}[0-9]+\.[0-9]+' file2
21-1-3 21:50:19.0|LOG|DESC1
This sed works with either GNU or POSIX sed:
sed -nE '/^[[:digit:]]{2,4}-[[:digit:]]{1,2}-[[:digit:]]{1,2}/{p;q;}' file
But awk, with the same BRE, is probably better:
awk '/^[[:digit:]]{2,4}-[[:digit:]]{1,2}-[[:digit:]]{1,2}/{print; exit}' file

Grep next word after pattern match

I'm trying to get grep/sed out the following output: "name":"test_backup_1" from the below response
{"backups":[{"name":"test_backup_1","status":"CORRUPTED","creationTime":"2019-11-08T15:03:49.460","id":"test_backup_1"}]}
I have been trying variations of the following grep -Eo 'name:"\w+\"' but no joy.
I'm not sure if it would be easier to achieve this using grep or sed?
The way I am running this is curling a response from the server and saving it to a local variable, then echo out the variable and pipe grep/sed
example of what I am running
echo ${view_backup} | grep -Eo '"name":"\w+\"'
Referencing #sundeep answer
grep -Eo '"name":"[^"]+"'
resulted in the expected output
Make sure to transform the file to one line before grep
and pipe from your curl
echo `curl --silent https://someurl | tr -d '\n' | grep -oP "(?<=name\":\")[^\"]+"`
will return
test_backup_1
If you want more variables you can chain the -oP grep like in this example where I get some data on a danish license plate (bt419329)
curl --silent https://www.tjekbil.dk/api/v2/nummerplade/bt41932 | grep -oP -m 1 "(?<=\"RegNr\":\")[^\"]+|(?<=\"MaerkeTypeNavn\":\")[^\"]+|(?<=\"MaksimumHastighed\":)[^,]+"| tr '\n' ' '
returns
BT41932 SKODA 218

Grep in bash with regex

I am getting the following output from a bash script:
INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist
and I would like to get only the path(MajorDomo/MajorDomo-Info.plist) using grep. In other words, everything after the equals sign. Any ideas of how to do this?
This job suites more to awk:
s='INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist'
awk -F' *= *' '{print $2}' <<< "$s"
MajorDomo/MajorDomo-Info.plist
If you really want grep then use grep -P:
grep -oP ' = \K.+' <<< "$s"
MajorDomo/MajorDomo-Info.plist
Not exactly what you were asking, but
echo "INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist" | sed 's/.*= \(.*\)$/\1/'
will do what you want.
You could use cut as well:
your_script | cut -d = -f 2-
(where your_script does something equivalent to echo INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist)
If you need to trim the space at the beginning:
your_script | cut -d = -f 2- | cut -d ' ' -f 2-
If you have multiple spaces at the beginning and you want to trim them all, you'll have to fall back to sed: your_script | cut -d = -f 2- | sed 's/^ *//' (or, simpler, your_script | sed 's/^[^=]*= *//')
Assuming your script outputs a single line, there is a shell only solution:
line="$(your_script)"
echo "${line#*= }"
Bash
IFS=' =' read -r _ x <<<"INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist"
printf "%s\n" "$x"
MajorDomo/MajorDomo-Info.plist

using sed to get only line number of "grep -in"

Which regexp should I use to only get line number from grep -in output?
The usual output is something like this:
241113:keyword
I need to get only "241113" from sed's output.
I suggest cut
grep -in keyword ... | cut -d: -f1
If you insist with sed:
grep -in keyword ... | sed 's/:.*$//g
You don't need to use sed. Cut is enough. Just pipe grep's output to
cut -d ':' -f 1
As an example:
grep -n blabla file.txt | cut -d ':' -f 1
Personally, I like awk
grep -in 'search' file | awk --field-separator : '{print $1}'
As said in other answers, cut is the right tool; but if you really want to use a swiss-army knife, you can also use awk:
grep -in keyword ... | awk -F: '{print $1}'
or using grep again:
grep -in keyword ... | grep -oE '^[0-9]+'
Just in case someone is wondering if all this could be done without grep, i.e. with sed alone ...
echo '
a
b
keyword
c
keyWord
x
y
keyword
Keyword
z
' |
sed -n '/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/{=;}'
#sed -n '/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/{=;q;}' # only line number of first match

how to grep part of the content from a string in bash

For example when filtering html file,
if every line is in this kind of pattern:
<i>some text</i>
how can I get the content of href, and how can I get the text between <i> and </i>?
cat file | cut -f2 -d\"
FYI: Just about every other HTML/regexp post on Stackoverflow explains why getting values from HTML using anything other than HTML parsing is a bad idea. You may want to read some of those. This one for example.
If href is always the second token separated by space in a,ine then u can try
grep "href" file | cut -d' ' -f2 | cut -d'=' -f2
Here's how to do it using xmlstarlet (optionally with tidy):
# extract content of href and <i>...</i>
echo '<i>some text</i>' |
xmlstarlet sel -T -t -m "//a" -v #href -n -v i -n
# using tidy & xmlstarlet
echo '<i>some text</i>' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:a" -v #href -n -v . -n