Search for Pattern in Text String, then Extract Matched Pattern - regex

I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string:
10289 20244
Text File:
KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009
I am trying to achieve this using the following bash code:
Bash Code:
cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}'
The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20.
I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.

grep solution:
grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file
The output:
10289 20244
[0-9]{3} - matches 3 digits
\b - word boundary

awk '{print $(NF-2),$(NF-1)}' text_file
10289 20244
Prints next to last and the one previous.

awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file
This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.

Related

How to search for multiple words of a specific pattern and separator?

I'm trying to trim out multiple hex words from my string. I'm searching for exactly 3 words, separated by exactly 1 dash each time.
i.e. for this input:
wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar
I'd like to get this output:
wonder-indexing-service-0.20.0.jar
I was able to remove the hex words by repeating the pattern. How can I simplify it? Also, I wasn't able to change * to +, to avoid allowing empty words. Any idea how to do that?
What I've got so far:
# Good, but how can I simplify?
% echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\-[a-fA-F0-9]*\-[a-fA-F0-9]*\-[a-fA-F0-9]*//g'
druid-indexing-service-0.20.0.jar
# Bad, I'm allowing empty words
% echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-.jar' | sed 's/\-[a-fA-F0-9]*\-[a-fA-F0-9]*\-[a-fA-F0-9]*//g'
druid-indexing-service-0.20.0.jar
Thank you!
EDIT: I had a typo in original output, thank you anubhava for pointing out.
You may use this sed:
s='wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar'
sed -E 's/(-[a-fA-F0-9]{3,})+//' <<< "$s"
wonder-indexing-service-0.20.0.jar
Breakup:
(: Start a group
-: Match a hyphen
[a-fA-F0-9]{3,}: Match 3 or more hex characters
)+: End the group. Repeat this group 1+ times
If you want to use the + you have to escape it \+, but you can repeat matching 3 words prepended by a hyphen using a quantifier which also need escaping
\(-[a-fA-F0-9]\+\)\{3\}
Example
echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\(-[a-fA-F0-9]\+\)\{3\}//g'
Output
wonder-indexing-service-0.20.0.jar
If you don't want to allow a trailing - then you can match the .jar and put that back in the replacement.
echo 'wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar' | sed 's/\(-[a-fA-F0-9]\+\)\{3\}\(\.jar$\)/\2/g'
printf "wonder-indexing-service-0.20.0-1605296913-49b045f-19794354.jar" | cut -d'-' -f1-4 | sed s'#$#.jar#'

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

grep regex how to get only results with one preceeding word?

My string is :
www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com
I an trying to get results only with "one" word before texas.com. Expectation when I do a regex grep :
mail.texas.com
www2.texas.com
So mail & www2 are the "one" word that I'm talking about. I tried :
grep "*.texas.com", but I get all of them in results. Can someone please help ?
You can use
grep '^[^.]*\.texas\.com'
Details:
^ - start of string
[^.]* - zero or more chars other than a . char
\.texas\.com - .texas.com string (literal . char must be escaped in the regex pattern).
See the online demo:
#!/bin/bash
s='www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com'
grep '^[^.]*\.texas\.com' <<< "$s"
Output:
mail.texas.com
www2.texas.com
With awk:
awk 'BEGIN{FS=OFS="."} /texas.com$/ && NF==3' file
Output:
mail.texas.com
www2.texas.com
Set one dot as input and output field separator, check for texas.com at the end ($) of your line and check for three fields.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With your shown samples, please try following awk code.
awk -F'.' 'NF==3 && $2=="texas" && $3=="com"' Input_file
Explanation: Simple making field separator as . for all the lines in awk program. Then in main program checking condition if NF==3(means number of fields in current line)are 3 AND 2nd field is texas and 3rd field is com if all 3 conditions are MET then print the line.

How to remove the characters after the last slash and two characters after it using sed?

Below is the sample file where I would like to trim the last character after the last occurrence of slash (/) and followed by two characters after it.
cat sample.txt
HOME_1, /u01/app/oracle/or121022
HOME_2, /u01/app/oracle/or112100881
HOME_3, /uo1/app/mysql/my588822222
I am trying to something like this:
cat sample.txt | sed 's%/[^/]\.\.*$%/\.\.%'
Expected Output:
HOME_1, /u01/app/oracle/or
HOME_2, /u01/app/oracle/or
HOME_3, /uo1/app/mysql/my
Any help is greatly appreciated.
Thanks!
Just remember the two characters after the slash in a capture group:
sed 's%\(/..\)[^/]*$%\1%'
[^/]*$ matches the rest of the string up to the end of line, and it's removed when the whole match gets replaced by the remembered part only.
Here is a non-regex based awk solution:
awk -F/ -v OFS=/ '{$NF=substr($NF, 1, 2)} 1' file
/ is used a field separator thus giving last part in $NF and then using substr we get first 2 characters.
Output:
HOME_1, /u01/app/oracle/or
HOME_2, /u01/app/oracle/or
HOME_3, /uo1/app/mysql/my
awk '{sub(/[0-9]+$/,"",$2)}1' file
HOME_1, /u01/app/oracle/or
HOME_2, /u01/app/oracle/or
HOME_3, /uo1/app/mysql/my
An awk example which removes digits at the end of col $2.
But we can skip this part ",$2" and it is still working.

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)