line and string position of grep match - regex

I need to find a way to output the exact coordinates of a grep match from one file to another. So say 'patterns' contains a list of string patterns to match. 'Search' is a line-based text (ASCII) file containing the text to search in.
with:
grep -onf patterns search
I get the line and the pattern that matches in this line but not wherein the line the pattern matches and this is what I need. It's not restricted to using grep, awk etc. is also fine!
Can you guys help?

Untested:
awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' file1 file2
Since you're using -f with grep I assume it's strings you want to match on, not regexps.
The above just builds an array of strings from the contents of the first file and then for each line of the second file it looks for the index of where each string occurs on that lie and if it exists prints the string, the line number and the index (starting position) of where that string first appears on that line.

Using awk you can do:
awk -v s="needle" 'i=index($0, s) {print NR, i}' file
This will print line # and line position of the searched item.
UPDATE:
while read -r line; do
awk -v s="$line" 'i=index($0, s) {print s ":" NR "," i}' searches
done < patterns
OR pure awk based:
awk 'FNR==NR{a[$0];next} {for (i in a) {if (p=index($0, i)) print i ":" NR "," p} }' patterns searches

Related

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

Regex match as many of strings as possible

I don't know if this is possible or makes sense, but what I'm trying to do is grep or awk a file matching for multiple strings, but only showing the match that matches the most strings.
So I would have a file like:
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall
phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
I want to match a single line that includes the most of these strings: cat|dog|table|key|wall. Not necessarily having to include all of them, but whatever line matches the most, print it.
So for example, I would want it to display this output:
cat,dog,key
Since it is the line that includes most of the strings that are being searched for.
I've tried using:
cat filename \
|egrep -iE 'cat' \
|egrep -iE 'dog' \
|egrep -iE 'table' \
|egrep -iE 'key' \
|egrep -iE 'wall'
But it will only display lines that show ALL strings, I have also tried:
egrep -iE 'cat|dog|table|key|wall' filename
But that shows any line that matches any one of those strings.
Is regex possible of doing something like this?
Use awk, and increment a counter for each word that matches. If the counter is higher than the highest count, save this line.
awk 'BEGIN {max = 0}
{ count=0;
if (/\bcat\b/) count++;
if (/\bdog\b/) count++;
...
if (count > max) { saved = $0; max = count; }
}
END { print saved; }'
$ awk -F, -v r='^(cat|dog|table|key|wall)$' '{c=0;for (i=1;i<=NF;i++)if ($i~r)c++; if (c>max){max=c;most=$0}} END{print most}' file
cat,dog,key
How it works
-F,
This sets the field separator to a comma.
-v r='^(cat|dog|table|key|wall)$'
This sets the variable r to a regex matching your words of interest. The regex begins with ^ and ends with $. This assures that only whole words are matched.
c=0;for (i=1;i<=NF;i++)if ($i~r)c++
This sets the variable c to the number of matches on the current line.
if (c>max){max=c;most=$0}
If the number of matches on the current line, c, exceeds the previous maximum, max, then update max and set most to the current line.
END{print most}
When we are done reading the file, print the line with the most matches.
To make the problem more interesting I created two input files:
InFile1 ...
cat|dog|table|key|wall
InFile2 ...
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
Note that InFile2 differs from the original post
in that it contains two lines each with three matches.
Hence, there is a "tie" for first place and both are
reported.
This code ...
awk -F, '{if (NR==FNR) r=$0; else {count=0
for (j=1;j<=NF;j++) if ($j ~ r) count++
a[FNR]=count" matching words in "$0
if (max<count) max=count}}
END{for (j=1;j<=FNR;j++) if (1==index(a[j],max)) print a[j]}' \
$InFile1 $InFile2 >$OutFile
... produced this OutFile ...
3 matching words in cat,dog,key
3 matching words in table,key,dog,banana
Daniel B. Martin

regex - match exactly to a string portion in awk

I have a file where one column contains strings that are composed of characters separated by ,
example:
a123456, a54321, a12312
I need to find lines that contain a specific number in the comma separated list.
example: I want to find all lines that contain only a12345.
I tried to use the following:
awk ' $1~/a12345/ {print}'
but this prints out the line containing:
a123456, a54321, a12312
because the regex is matching the first 6 characters in a123456, I guess.
My question is, how can I make an regex that will only print out the lines that contain only an exact match?
$ awk '/(^|[^[:alnum:]])a12345([^[:alnum:]]|$)/' file
$ awk '/(^|[^[:alnum:]])a123456([^[:alnum:]]|$)/' file
a123456, a54321, a12312
With GNU awk you could use word-delimiters:
$ awk '/\<a12345\>/' file
$ awk '/\<a123456\>/' file
a123456, a54321, a12312
Try using word match of grep like below:
grep -w a123456 myfile.txt
if you need in field that just starts, then use something like:
egrep -w ^a123456 myfile.txt
With awk:
awk -F ',\\s*' '$1 == "a12345"' filename
To split the line along commas (optionally followed by whitespace) and select only those lines whose first field is exactly "a12345". This will work even if the field contains characters after "a12345" that count as a word boundary, which is to say that
a12345.foo, bar, baz
is filtered out.
If more than a single field is to be tested, then you'll have to test all fields:
awk -F ',\\s*' 'function check() { for(i = 1; i <= NF; ++i) { if($i == "a12345") return 1; } return 0 } check()' filename

Remove \n newline if string contains keyword

I'd like to know if I can remove a \n (newline) only if the current line has one ore more keywords from a list; for instance, I want to remove the \n if it contains the words hello or world.
Example:
this is an original
file with lines
containing words like hello
and world
this is the end of the file
And the result would be:
this is an original
file with lines
containing words like hello and world this is the end of the file
I'd like to use sed, or awk and, if needed, grep, wc or whatever commands work for this purpose. I want to be able to do this on a lot of files.
Using awk you can do:
awk '/hello|world/{printf "%s ", $0; next} 1' file
this is an original
file with lines
containing words like hello and world this is the end of the file
here is simple one using sed
sed -r ':a;$!{N;ba};s/((hello|world)[^\n]*)\n/\1 /g' file
Explanation
:a;$!{N;ba} read whole file into pattern, like this: this is an original\nfile with lines\ncontaining words like hell\
o\nand world\nthis is the end of the file$
s/((hello|world)[^\n]*)\n/\1 /g search the key words hello or world and remove the next \n,
g command in sed substitute stands to apply the replacement to all matches to the regexp, not just the first.
A non-regex approach:
awk '
BEGIN {
# define the word list
w["hello"]
w["world"]
}
{
printf "%s", $0
for (i=1; i<=NF; i++)
if ($i in w) {
printf " "
next
}
print ""
}
'
or a perl one-liner
perl -pe 'BEGIN {#w = qw(hello world)} s/\n/ / if grep {$_ ~~ #w} split'
To edit the file in-place, do:
awk '...' filename > tmpfile && mv tmpfile filename
perl -i -pe '...' filename
This might work for you (GNU sed):
sed -r ':a;/^.*(hello|world).*\'\''/M{$bb;N;ba};:b;s/\n/ /g' file
This checks if the last line, of a possible multi-line, contains the required string(s) and if so reads another line until end-of-file or such that the last line does not contain the/those string(s). Newlines are removed and the line printed.
$ awk '{ORS=(/hello|world/?FS:RS)}1' file
this is an original
file with lines
containing words like hello and world this is the end of the file
sed -n '
:beg
/hello/ b keep
/world/ b keep
H;s/.*//;x;s/\n/ /g;p;b
: keep
H;s/.*//
$ b beg
' YourFile
a bit harder due to check on current line that may include a previous hello or world already
principle:
on every pattern match, keep the string in hold buffer
other wise, load hold buffer and remove \n (use of swap and empty the current line due to limited buffer operation available) and print the content
Add a special case of pattern in last line (normaly hold so not printed otherwise)

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,
awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf
hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH
In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file