Regular expression to search column in text file - regex

I am having trouble getting a regular expression that will search for an input term in the specified column. If the term is found in that column, then it needs to output that whole line.
These are my variables:
sreg = search word #Example: Adam
file = text file #Example: Contacts.txt
sfield = column number #Example: 1
the text file is in this format with a space being the field seperator, with many contact entries:
First Last Email Phone Category
Adam aster junfmr# 8473847548 word
Jeff Williams 43wadsfddf# 940342221995 friend
JOhn smart qwer#qwer 999999393 enemy
yooun yeall adada 111223123 other
zefir sentr jjdirutk#jd 8847394578 other
I've tried with no success:
grep "$sreg" "$file" | cut -d " " -f"$sfield"-"$sfield"
awk -F, '{ if ($sreg == $sfield) print $0 }' "$file"
awk -v s="$sreg" -v c="$sfield" '$c == s { print $0 }' "$file"
Thanks for any help!

awk may be the best solution for this:
awk -v field="$field" -v name="$name" '$field==name' "$file"
This checks if the field number $field has the value $name. If so, awk automatically prints the full line that contains it.
For example:
$ field=1
$ name="Adam"
$ file="your_file"
$ awk -v field="$field" -v name="$name" '$field==name' "$file"
Adam aster junfmr# 8473847548 word
As you can see, we give the parameters using -v var="$bash_var", so that you can use them inside awk.
Also, the space is the field separator, so you don't need to specify it since it is the default.

This works for me:
awk -v f="$sfield" -v reg="$sreg" '{if ($f ~ reg) {print $0}}' "$file"
Major problem is that you need an indirection from $sfield (ex, "1") to $($sfield) (ex, $1).
I tried using backtricks `, and also using ${!sfield}, but they don't work in awk, as awk does not accept this. Finally I found the way of passing variable into awk, converting to awk internal variabls (using -v).
Within awk, I found you can not even access variables outside. So I had to pass $sreg as well.
Update: I think using "~" instead of "==" is better because the original requirement said matchi==ng a regular expression.
For example,
sreg=Ad

Related

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

awk: match regex with or operator of two shell variables

I am trying to use different stuff with awk.
First, the use of some shell variables, which here shows how to use them.
Second, how to use a shell variable to match a pattern, which here points to use ~ operator.
Finally, I want to use some kind of or operator to match two shell variables.
Putting all together in foo.sh:
#!/bin/bash
START_TEXT="My start text"
END_TEXT="My end text"
awk -v start=$START_TEXT -v end=$END_TEXT '$0 ~ start || $0 ~ end { print $2 }' myfile
Which fails to run:
$ ./foo.sh
awk: fatal: cannot open file `text' for reading (No such file or directory)
So I think the OR-operator (||) does not work well with regex ~ operator.
I was guessing I may need to do the OR-thing inside the regex.
So I tried these two:
awk -v start=$START_TEXT -v end=$END_TEXT '$0~/start|end/ { print $2 }' myfile
awk -v start=$START_TEXT -v end=$END_TEXT '$0~start|end { print $2 }' myfile
With same failed result.
And even this thing fails...
awk -v start=$START_TEXT '$0~start { print $2 }' myfile
So I am doing something really wrong...
Any hints how to achieve this?
You can do the regex OR like this:
awk -v start="$START_TEXT" -v end="$END_TEXT" '$0~ start "|" end { print $2 }' myfile
awk knows the parameter passed to ~ operator is a regex, so we can just process it by insert the | or operator between two strings.
Also there's another way to pass variables into awk, like this:
awk '$0~ start "|" end { print $2 }' start="$START_TEXT" end="$END_TEXT" myfile
This will increase conciseness. But since it's less intuitive, so use it with caution.
Well, it seems #jxc pointed my problem in the comments: the shell variables need to be quoted.
awk -v start="$START_TEXT" -v end="$END_TEXT" '$0~start || $0~end { print $2 }' myfile
That made it work!

Remove a line from file that starts with a number in bash

I am trying to create a simple CSV editor in bash,
and I struggle with removing a line. The user passes in the ID of
the line to remove (each row is defined with an ID as the first column).
This is an example file structure:
ID,Name,Surname
0,Mark,Twain
1,Cristopher,Jones
So, having the id saved as a variable and the file name in another variable (say its file.csv) I attempt to remove it from bash with this line:
read -p "Pass the object's ID: " idtoremove
fname=file.csv
sed -i -e "'/^$idtoremove*,/d'" $fname
However, this has no effect on the file. What could be wrong with this line?
Also, how can I replace a line starting with given ID with a string from a variable? This is another problem I will have to face but I have no Idea how to approach this one.
Following script could help you. It asks user to enter an id.
cat script.ksh
echo "Please enter the id to be removed:"
read value
awk -v val="$value" -F, '$1!=val' Input_file
In case you want to save output into Input_file itself append > tmp_file && mv tmp_file Input_file in above awk code.
With sed:
cat script.ksh
echo "Please enter the id to be removed:"
read value
sed "/^$value,/d" Input_file
Kindly use sed -i.bak option in above sed to save output into Input_file itself and have a backup of Input_file(before change) too.
This is best done in awk:
awk -v id="$idtoremove" -F, '$1 != id' file.csv
If you're using gnu awk then you can save in-place also:
awk -i inplace -v id="$idtoremove" -F, '$1 != id' file.csv
For other awk versions use:
awk -v id="$idtoremove" -F, '$1 != id' file.csv > $$.csv &&
mv $$.csv file.csv

bash replace part of text separated with marker that includes a keyword

I would like to replace everything between : if there's a keyword in it
Having
TEXT="/Something/like-this:/How/can-one-replace/text/separated/with/colon/that-includes/a/keyword?:There/may-be/multiple/keywords:/Thanks:/keyword"
with:
sed -e 's/regex here that searches for keyword/\/some\/path/g' <<< $TEXT
To get:
/Something/like-this:/some/path:/some/path:/Thanks:/some/path
P.S.
Another example to make it more clear: How can paths that includes hello be replaced with another path?
/opt/hello/bin:/bin:/home/user/hello:/home/user/bin:/media/hello
=>
/some/path:/bin:/some/path:/home/user/bin:/some/path
My apologies for unclear question.
I think you need this,
$ sed -r 's~^([^:]+):.*:([^:]+):(.*)$~\1:/Replacement:/Replacement:\2:/Replacement~g' file
/Something/like-this:/Replacement:/Replacement:/Thanks:/Replacement
Or something like this,
$ sed -r 's~^([^:]+):.*:([^:]+):(.*)$~\1:/*Replacement*:/*Replacement*:\2:/*Replacement*~g' file
/Something/like-this:/*Replacement*:/*Replacement*:/Thanks:/*Replacement*
Or
it may be like this, if you assign some path to Replacement variable,
$ Replacement=/foo/bar
$ sed -r "s~^([^:]+):.*:([^:]+):(.*)$~\1:/*$Replacement*:/*$Replacement*:\2:/*$Replacement*~g" file
/Something/like-this:/*/foo/bar*:/*/foo/bar*:/Thanks:/*/foo/bar*
Or
You may try this also,
awk -v RS=: -v var=/path -v ORS=: '{sub (/.*hello.*/,var)}1' file
Example:
$ echo '/opt/hello/bin:/bin:/home/user/hello:/home/user/bin:/media/hello' | awk -v RS=: -v var=/foo/bar -v ORS=: '{sub (/.*hello.*/,var)}1'
/foo/bar:/bin:/foo/bar:/home/user/bin:/foo/bar:
Explanation:
Awk inbuilt variable RS(Record seperator) and ORS(Output Record Seperator) are set to :. So awk breaks the string whenever it finds : in the input and treats the text after : would be in the next line.
ORS is set to :, so awk prints the records with : as seperator.
-v var=/foo/bar , Replacement string is assigned to a variable var.
sub (/.*hello.*/,var), if the record matches this regex, it replaces the whole record with the value in the variable var.
1, to print all the records.
My version:
sed 's/:/::/g;s/^/:/;s/$/:/;s/:[^:]*keyword[^:]*:/:REPLACEMENT:/g;s/^://;s/:$//;s/::/:/g'
With bash
IFS=: read -ra arr <<<'/opt/hello/bin:/bin:/home/user/hello:/home/user/bin:/media/hello'
v=$(IFS=:; printf "%s\n" "${arr[*]/*hello*/\/some\/path}")
echo $v
/some/path:/bin:/some/path:/home/user/bin:/some/path

Using awk to grab only numbers from a string

Background:
I have a column that should get user input in form of "Description text ref12345678". I have existing scripts that grab the reference number but unfortunately some users add it incorrectly so instead of "ref12345678" it can be "ref 12345678", "RF12345678", "abcd12345678" or any variation. Naturally the wrong formatting breaks some of the triggered scripts.
For now I can't control the user input to this field, so I want to make the scripts later in the pipeline just to get the number.
At the moment I'm stripping the letters with awk '{gsub(/[[:alpha:]]/, "")}; 1', but substitution seems like an inefficient solution. (I know I can do this also with sed -n 's/.*[a-zA-Z]//p' and tr -d '[[:alpha:]]' but they are essentially the same and I want awk for additional programmability).
The question is, is there a way to set awk to either print only numbers from a string, or set delimits to numeric items in a string? (or is substitution really the most efficient solution for this problem).
So in summary: how do I use awk for $ echo "ref12345678" to print only "12345678" without substitution?
if awk is not a must:
grep -o '[0-9]\+'
example:
kent$ echo "ref12345678"|grep -o '[0-9]\+'
12345678
with awk for your example:
kent$ echo "ref12345678"|awk -F'[^0-9]*' '$0=$2'
12345678
You can also try the following with awk assuming there will be only one number in a string:
awk '{print ($0+0)}'
This converts your entire string to numeric, and the way that awk is implemented only the values that fit the numeric description will be left. Thus for example:
echo "19 trees"|awk '{print ($0+0)}'
will produce:
19
In AWK you can specify multiple conditions like:
($3~/[[:digit:]+]/ && $3 !~/[[:alpha:]]/ && $3 !~/[[:punct:]]/ ) {print $3}
will display only digit without any alphabet and punctuation.
with !~ means not contain any.
grep works perfectly :
$ echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'
300
9
1.1
Step by step explanation:
-E
Use extended regex.
-o
Return only the matches, not the context
[+-]?[0-9]+([.][0-9]+)?+
Match numbers which are identified as:
[+-]?
An optional leading sign
[0-9]+
One or more numbers
([.][0-9]+)?
An optional period followed by one or more numbers.
it is convenient to put the output in an array
arr=($(echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'))
and then use it like this
Tin=${arr[0]}
maxl=${arr[1]}
etc..
Another option (assuming GNU awk) involves specifying a non-numeric regular expression as a separator
awk -F '[^0-9]+' '{OFS=" "; for(i=1; i<=NF; ++i) if ($i != "") print($i)}'