print line that matches first field (bash) - regex

I'm trying to read userinput, have that match the first field of a csv file, and print out the entire line. Here's what i've come up with:
#/bin/bash
echo "enter number: "
read USERINPUT
LINENUMBER=$(awk -v FS=',' '{print $1}' < test.csv | grep -n "$USERINPUT")
FULLLINE=$(sed -n $LINENUMBER\p test.csv)
echo $FULLLINE
The problem i'm running into is say i set USERINPUT=4 but my csv file has several lines like 4, 421, 444, etc.. i match all of them. How do i make
grep -n "$USERINPUT"
only match exactly what it is set to and nothing else?

Instead of printing the first column of every line, then using grep, you should just do the whole thing in awk:
line_number=$(awk -F, -v s="$number" '$1==s{print NR}' test.csv)
If you just want to print the line, that's simple:
awk -F, -v s="$number" '$1==s' test.csv
By the way, instead of using an echo followed by a read, you can use read -p which allows you to specify a prompt:
read -p "enter number: " number

#/bin/bash
read -p "enter number: " num
grep "^$num," test.csv

The -o grep option prints only what matches the regular expression.
E.g.
grep -o '.*USERINPUT.*'
or
grep -o '^USERINPUT.*'
etc.

#/bin/bash
echo "enter number: "
read USERINPUT
# for a var assignation and print content
FULLLINE=$(egrep "^${USERINPUT%% *}," test.csv )
echo $FULLLINE
# for only a print
egrep "^${USERINPUT%% *}," test.csv
Use of egrep to include deleimiter (start line and coma around the input)
Use of a small input test removing trailing space via ${VarName%% *}

Related

Passing shell variable to awk in for-loop

I'm writing a script to print column and row numbers of cells which match a given string and then output it to a text file. The individual awk commands work fine in terminal and I've resolved other syntax issues, but .txt that is output still comes up empty. I think I have a problem with passing shell variables to awk.
#!/bin/bash
echo Literal or regex string to find:
read string
echo File path to find string match in:
read filename
echo "Matches for $string were found in the following cells:" > results.txt
for string in filename
do
awk -v awkvar="$string" -F"," '{for(i=1;i<=NF;i++){if ($i ~ /awkvar/){print i}}}' $filename >> results.txt | echo -e "\n" >> results.txt
awk -v awkvar="$string" '/awkvar/{print NR}' $filename >> results.txt | echo -e "\n" >> results.txt
done
Problem Resolved
I've rewritten the script as follows:
#!/bin/bash
# Prompt for input: 1. enter file name or path that you want searched; 2. enter the literal or regex string
echo File name or path to find matches in:
read file
echo Literal or regex string to find:
read string
# Define variable and test if any matches are to be found; if not, notification is sent to terminal, but if matches exist, their row numbers (as summary rows) and individual column numbers will be output to a .txt file in the home directory. NB: you need to escape minus symbol with brackets, [-], so that it's not confused with an invalid grep option!
matchesFound=$(cat $file | grep -E -c "$string")
if [ $matchesFound -eq 0 ];
then
echo "No matches exist."
else
printf "Summary Row No: \n`awk -v awkvar="$string" '$0 ~ awkvar{print NR}' $file`" > results_for_$string.txt
printf "\nInstance Column No: \n`awk -v awkvar="$string" -F"," '{for(i=1;i<=NF;i++){if ($i ~ awkvar){print i}}}' $file`" >> results_for_$string.txt
fi
You can't use awk variables inside regexp check pattern, try following instead. You could use index function of awk and to check if condition try without /../ way.
awk -v awkvar="$string" -F"," '{for(i=1;i<=NF;i++){if ($i ~ awkvar){print i}}}' $filename >> results.txt | echo -e "\n" >> results.txt
awk -v awkvar="$string" 'index($0,awkvar){print NR}' $filename >> results.txt | echo -e "\n" >> results.txt
This answer deals with only awk code shown by OP as per question, to fix it.

Removing rows that contains "(null)" value from a text file

I would like to remove any row within a .txt file that contains "(null)". The (null) value is always in the 3rd column. I would like to add this to a script that I already have.
Txt file example:
39|1411|XXYZ
40|1416|XXX
41|1420|(null)
In this example I would like to remove the third row.
Im guessing its an awk -F but not sure from there.
You are on the right track with using -F.
$ awk -F '|' '$3 != "(null)"' file.txt
39|1411|XXYZ
40|1416|XXX
You set the field separator to |, then print all lines where the third field is not equal to (null). This uses awk's default of "print the line" if there's no action associated with a pattern.
If you relax the requirement to specifically test the third field, and there is no other place for the "(null)" substring to occur, you can get the same result with
grep -vF '(null)' file.txt
With awk:
awk '-F|' '$3 != "(null)"' < input-file
Here is a sed:
$ sed '/(null)$/d' file
39|1411|XXYZ
40|1416|XXX
The $ assures that the (null) is at the end of the line. If you want to assure that (null) is the final column:
$ sed '/\|(null)$/d' file
And if you want to be extra sure that it is the third column:
$ sed '/^[^|]*\|[^|]*\|(null)$/d' file
Or with grep:
$ grep -v '^[^|]*|[^|]*|(null)$'
(But instead of this last one, just use awk...)
Use grep:
grep -v '|.*|(null)' in_file
Here, grep uses option -v : print lines that do not match.
Or use Perl:
perl -F'[|]' -lane 'print if $F[2] ne "(null)";' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'[|]' : Split into #F on literal |, rather than on whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
I would like to remove any row within a .txt file that contains "(null)"
If you wish to do that using AWK let file.txt content be
39|1411|XXYZ
40|1416|XXX
41|1420|(null)
then
awk '!index($0,"(null)")' file.txt
will output
39|1411|XXYZ
40|1416|XXX
Explanation: index return position of first occurence of substring ((null) in this case) or 0 if none will find, I negate what is return thus getting truth for 0 and false for anything else and AWK does print where result was truth.

Validating specific column in grep

Ok this is driving me crazy. I have a text file with the following content:
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
"1","2","3","4","text","2020-12-12","2020-04-11","21"
"1","2","3","4","text","2020-05-21","2020-03-23","453"
etc.
I want to filter lines on which the second date is in december, I tried things like:
grep '.*(\d{4}-\d{2}-\d{2}).*(2020-12-).*' > output.txt
grep '.*\d{4}-\d{2}-\d{2}.*2020-12-.*' > output.txt
grep -P '.*\d{4}-\d{2}-\d{2}.*2020-12-.*' > output.txt
But nothing seems to work. Is there any way to accomplish this with either grep, egrep, sed or awk?
You need to use -P option of grep to enable perl compatible regular expressions, could you please try following. Written and tested with your shown samples.
grep -P '("\d+",){4}"[a-zA-Z]+","2020-12-\d{2}"' Input_file
Explanation: Adding explanation for above, following is only for explanation purposes.
grep ##Starting grep command from here.
-P ##Mentioning -P option for enabling PCRE regex with grep.
'("\d+",){4} ##Looking for " digits " comma this combination 4 times here.
"[a-zA-Z]+", ##Then looking for " alphabets ", with this one.
"2020-12-\d{2}" ##Then looking for " 2020-12-07 date " which OP needs.
' Input_file ##Mentioning Input_file name here.
I suggest an alternate solution awk due to input data structured in rows and columns using a common delimiter:
awk -F, '$7 ~ /-12-/' file
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
Use either grep -P or egrep for short:
$ cat test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
"1","2","3","4","text","2020-12-12","2020-04-11","21"
"1","2","3","4","text","2020-05-21","2020-03-23","453"
$
$ grep -P '^"([^"]*","){6}2020-12-' test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
$
$ egrep '^"([^"]*","){6}2020-12-' test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
Explanation:
^" - expect a " to start
([^"]*","){6} - scan over all chars other than ", followed by ","; repeat that 6 times
2020-12- - expect 202012-
The problem is in:
egrep '.*\d{4}-\d{2}-\d{2}.2020-12-.' > output.txt
^ HERE
The . just matches a single character, but you want to skip ",", so change to:
egrep '.*\d{4}-\d{2}-\d{2}.+2020-12-.' > output.txt
^^ HERE
The . becomes a .+.

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

Extract all numbers from a text file and store them in another file

I have a text file which have lots of lines. I want to extract all the numbers from that file.
File contains text and number and each line contains only one number.
How can i do it using sed or awk in bash script?
i tried
#! /bin/bash
sed 's/\([0-9.0-9]*\).*/\1/' <myfile.txt >output.txt
but this didn't worked.
grep can handle this:
grep -Eo '[0-9\.]+' myfile.txt
-o tells to print only the matches and [0-9\.]+ is a regular expression to match numbers.
To put all numbers on one line and save them in output.txt:
echo $(grep -Eo '[0-9\.]+' myfile.txt) >output.txt
Text files should normally end with a newline characters. The use of echo above assures that this happens.
Non-GNU grep:
If your grep does not support the -o flag, try:
echo $(tr ' ' '\n' <myfile.txt | grep -E '[0-9\.]+') >output.txt
This uses tr to replace all spaces with newlines (so each number appears separately on a line) and then uses grep to search for numbers.
tr -sc '0-9.' ' ' "$file"
Will transform every string of non-digit-or-period characters into a single space.
You can also use Bash:
while read line; do
if [[ $line =~ [0-9\.]+ ]]; then
echo $BASH_REMATCH
fi
done <myfile.txt >output.txt