Using awk to grab only numbers from a string - regex

Background:
I have a column that should get user input in form of "Description text ref12345678". I have existing scripts that grab the reference number but unfortunately some users add it incorrectly so instead of "ref12345678" it can be "ref 12345678", "RF12345678", "abcd12345678" or any variation. Naturally the wrong formatting breaks some of the triggered scripts.
For now I can't control the user input to this field, so I want to make the scripts later in the pipeline just to get the number.
At the moment I'm stripping the letters with awk '{gsub(/[[:alpha:]]/, "")}; 1', but substitution seems like an inefficient solution. (I know I can do this also with sed -n 's/.*[a-zA-Z]//p' and tr -d '[[:alpha:]]' but they are essentially the same and I want awk for additional programmability).
The question is, is there a way to set awk to either print only numbers from a string, or set delimits to numeric items in a string? (or is substitution really the most efficient solution for this problem).
So in summary: how do I use awk for $ echo "ref12345678" to print only "12345678" without substitution?

if awk is not a must:
grep -o '[0-9]\+'
example:
kent$ echo "ref12345678"|grep -o '[0-9]\+'
12345678
with awk for your example:
kent$ echo "ref12345678"|awk -F'[^0-9]*' '$0=$2'
12345678

You can also try the following with awk assuming there will be only one number in a string:
awk '{print ($0+0)}'
This converts your entire string to numeric, and the way that awk is implemented only the values that fit the numeric description will be left. Thus for example:
echo "19 trees"|awk '{print ($0+0)}'
will produce:
19

In AWK you can specify multiple conditions like:
($3~/[[:digit:]+]/ && $3 !~/[[:alpha:]]/ && $3 !~/[[:punct:]]/ ) {print $3}
will display only digit without any alphabet and punctuation.
with !~ means not contain any.

grep works perfectly :
$ echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'
300
9
1.1
Step by step explanation:
-E
Use extended regex.
-o
Return only the matches, not the context
[+-]?[0-9]+([.][0-9]+)?+
Match numbers which are identified as:
[+-]?
An optional leading sign
[0-9]+
One or more numbers
([.][0-9]+)?
An optional period followed by one or more numbers.
it is convenient to put the output in an array
arr=($(echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'))
and then use it like this
Tin=${arr[0]}
maxl=${arr[1]}
etc..

Another option (assuming GNU awk) involves specifying a non-numeric regular expression as a separator
awk -F '[^0-9]+' '{OFS=" "; for(i=1; i<=NF; ++i) if ($i != "") print($i)}'

Related

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

How to display words as per given number of letters?

I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.

Capture strings from several sets of quotes

been looking for a straight answer to this but not found anything within SO or wider searching that answers this simple question:
I have a string of quoted values, ip addresses in this case, that I want to extract individually to use as values elsewhere. I am intending to do this with sed and regex. The string format is like this:
"10.10.10.101","10.10.10.102","10.10.10.103"
I can capture the values between all quotes using regex such as:
"([^"]*)"
Question is how do I select each group separately so I can use them?
i.e.:
value1 = 10.10.10.101
value2 = 10.10.10.102
value3 = 10.10.10.103
I assume that I need three expressions but I cannot find how to select a specific occurance.
Apologies if its obvious but I have spent a while searching and testing with no luck...
You can try this bash:
$ str="10.10.10.101","10.10.10.102","10.10.10.103"
$ IFS="," arr=($str)
$ echo ${arr[1]}
10.10.10.102
If you have GNU awk, you can use FPAT to set the pattern for each field:
awk -v FPAT='[0-9.]+' '{ print $1 }' <<<'"10.10.10.101","10.10.10.102","10.10.10.103"'
Substitute $1 for $2 or $3 to print whichever value you want.
Since your fields don't contain spaces, you could use a similar method to read the values into an array:
read -ra ips < <(awk -v FPAT='[0-9.]+' '{ $1 = $1 }1' <<<'"10.10.10.101","10.10.10.102","10.10.10.103"')
Here, $1 = $1 makes awk reformat each line, so that the fields are printed with spaces in between.
Using grep -P you can use match reset:
s="10.10.10.101","10.10.10.102","10.10.10.103"
arr=($(grep -oP '(^|,)"\K[^"]*' <<< "$s"))
# check array content
declare -p arr
declare -a arr='([0]="10.10.10.101" [1]="10.10.10.102" [2]="10.10.10.103")'
If your grep doesn't support -P (PCRE) flag then use:
arr=($(grep -Eo '[.[:digit:]]+' <<< "$s"))
Here is an awk command that should work for BSD awk as well:
awk -F '"(,")?' '{for (i=2; i<NF; i++) print $i}' <<< "$s"

Incorporate egrep regexps with awk?

I've been trying to understand how awk can work with egrep regular expressions.
I have the following example:
John,Milanos
Anne,Silverwood
Tina,Fastman
Adrian,Thomassonn
I'm looking to use egrep regexps to process the second column (the last names in this scenario) while printing the entire line for output.
The closest I've come to my answer was using?
$ awk -F ',' '{print $2}' | egrep '([a-z])\1.*([a-z])\2'
Thomassonn
I would then take "Thomassonn" and egrep back into my initial list of full names to get the full record. However, I've encountered plenty of errors and false positives once I used other filters.
Desired result:
Adrian,Thommasson
awk does not support back-references within a regex. egrep, however, is sufficient to achieve your desired result:
$ egrep ',.*([a-z])\1.*([a-z])\2' file
Adrian,Thomassonn
Variations
If there are three or more columns and you want to search only the second:
egrep '^[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the third column:
egrep '^[^,]*,[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the first of any number of columns:
egrep '^[^,]*([a-z])\1[^,]*([a-z])\2' file
awk doesn't support backreferences, here's one way to do what you want instead:
$ cat tst.awk
BEGIN{ FS="," }
{
numMatches = 0
fld = $2
for (charNr=1; charNr <= length($2); charNr++) {
char = substr($2,charNr,1)
if (char ~ /[a-z]/)
numMatches += gsub(char"{2}"," ",fld)
}
}
numMatches >= 2
$
$ awk -f tst.awk file
Adrian,Thomassonn
If you want to match sequences of 3 or any other number of repeated chars, just change {2} to {3} or whatever number you like.
By the way, for portability to all locales you should use [[:lower:]] instead of [a-z] if that's what you really mean.

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'