BASH: Search a string and exactly display the exact number of times a substring happens inside it - regex

I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.
sample.txt is:
AAAAA
I want to find the exact times the combination "AAA" happens. If you just use for example
grep -o 'AAA' sample.txt | wc -l
We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.
I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.
How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.

This might work for you (GNU sed & wc):
sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l
Lose any characters other than A's, and single or double A's.Then print a triple A and lose the first A and repeat. Finally count the number of lines printed.

This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:
#!/bin/bash
search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0
# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do
# Look at just the length of the search term
text_substr=${text:0:${term_len}}
# If we see the search term, increment occurences
if [ "$text_substr" = "$search_term" ]; then
((occurences++))
fi
# Remove the first character from the main text
# (e.g. "AAAAA" becomes "AAAA")
text=${text:1}
done
printf "%d occurences of %s\n" "$occurences" "$search_term"

This is the awk version
echo "AAAAA AAA AAAABBAAA" \
| gawk -v pat="AAA" '{
for(i=1; i<=NF; i++){
# current field length
m=length($i)
#search pattern length
n=length(pat)
for(l=1 ; l<m; l++){
sstr=substr($i,l,n)
#print i " " $i " sub:" sstr
# substring matches pattern
if(sstr ~ pat){
count++
}else{
print "contiguous count on field " i " = " count
# uncomment next line if non-contiguous matches are not needed
#break
}
}
print "total count on field " i " = " count
count=0
}
}'

I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.
$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
while(t=index(x,$0)){ n++; x=substr(x,t+1) }
print $0,n
}' combinations
Explanation:
The variables:
sample: is the raw sample text slurp in from the file sample.txt with the -v argument
x: is the targeting string, before each test, the value is reset to sample
$0: is the testing string from the file combination, each line feeds a testing string
n: is the counter, number of occurences of the testing string($0)
t: is the position of the first character of the matched testing string($0) in the targeting string(x)
Update: Added $0 != "" before the main while loop to skip EMPTY strings which lead to unlimited loop.
The code:
awk -v sample="$(<sample.txt)" '
# reset the targeting string(with the sample text) and the counter "n"
{ x = sample; n = 0 }
# below the main block where $0 != "" to skip the EMPTY testing string
($0 != ""){
# the function index(x, $0) returns the position(assigned to "t") of the first character
# of the matched testing string($0) in the targeting string(x).
# when no match is found, it returns zero and thus step out of the while loop.
while(t=index(x,$0)) {
n++; # increment the number of matches
x = substr(x, t+1) # modify the targeting string to remove all characters before the position(t) inclusively
}
print $0, n # print the testing string and the counts
}
' combinations
awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:
$ more sample.txt
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH
$ more combinations
AA
HH
AAA
HHH
AAH
HHA
ZK
Tested Environment: GNU Awk 4.0.2, Centos 7.3

Related

Using grep and regex to extract words from a file that contain only one kind of vowel

I have a large dictionary file that contains one word per line.
I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.
This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt
returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?
How about
grep -i '^[^aiou]*e[^aiou]*$'
?
Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.
awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
c = "[b-df-hj-np-tv-z]"
for (v in vowel)
regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
$0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
hit[v] = ( hit[v] ? hit[v] ORS : "") $0
next } }
END { for (v in vowel) {
printf "=== %s ===\n", vowel[v]
print hit[v] } }' /usr/share/dict/words
You'll notice that it prints words with syllabic y like jolly and cycle. A more complex regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.
The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.
Demo: https://ideone.com/wNrvPu
Using -P (perl) option:
^(?=.*e)[^aiou]+$
Explanation:
^ # beginning of line
(?=.*e) # positive lookahead, make sure we at least 1 "e"
[^aiou]+ # 1 or more any character that is not vowel
$ # end of line
cat file.txt
see
best
levee
whenever
like
house
and
BBC
BMW
grep -P '^(?=.*e)[^aiou]+$' file.txt
see
best
levee
whenever

Regex: find elements regardless of order

If I have the string:
geo:FR, host:www.example.com
(In reality the string is more complicated and has more fields.)
And I want to extract the "geo" value and the "host" value, I am facing a problem when the order of the keys change, as in the following:
host:www.example.com, geo:FR
I tried this line:
sed 's/.\*geo:\([^ ]*\).\*host:\([^ ]*\).*/\1,\2/'
But it only works on the first string.
Is there a way to do it in a single regex, and if not, what's the best approach?
I suggest extracting each text you need with a separate sed command:
s="geo:FR, host:www.example.com"
host="$(sed -n 's/.*host:\([^[:space:],]*\).*/\1/p' <<< "$s")"
geo="$(sed -n 's/.*geo:\([^[:space:],]*\).*/\1/p' <<< "$s")"
See the online demo, echo "$host and $geo" prints
www.example.com and FR
for both inputs.
Details
-n suppresses line output and p prints the matches
.* - matches any 0+ chars up the last...
host: - host: substring and then
\([^[:space:],]*\) - captures into Group 1 any 0 or more chars other than whitespace and a comma
.* - the rest of the line.
The result is just the contents of Group 1 (see \1 in the replacement pattern).
Whenever you have tag/name to value pairs in your input I find it best (clearest, simplest, most robust,, easiest to enhance, etc.) to first create an array that contains that mapping (f[] below) and then you can simply access the values by their tags:
$ cat file
geo:FR, host:www.example.com
host:www.example.com, geo:FR
foo:bar, host:www.example.com, stuff:nonsense, badgeo:uhoh, geo:FR, nastygeo:wahwahwah
$ cat tst.awk
BEGIN { FS=":|, *"; OFS="," }
{
for (i=1; i<=NF; i+=2) {
f[$i] = $(i+1)
}
print f["geo"], f["host"]
}
$ awk -f tst.awk file
FR,www.example.com
FR,www.example.com
FR,www.example.com
The above will work using any awk in any shell on every UNIX box.
Here I've used GNU Awk to convert your delimited key:value pairs to valid shell assignment. With Bash, you can load these assignments into your current shell using <(process substitution):
# source the file descriptor generated by proc sub
. < <(
# use comma-space as field separator, literal apostrophe as variable q
awk -F', ' -vq=\' '
# change every foo:bar in line to foo='bar' on its own line
{for(f=1;f<=NF;f++) print gensub(/:(.*)/, "=" q "\\1" q, 1, $f)}
# use here-string to load text; remove everything but first quote to use standard input
' <<< 'host:www.example.com, geo:FR'
)

Search text for multiple lines matching string 1 which are not separated by string 2

I've got a file looking like this:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|some|supplementary|information
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|supplementary|information
I'm looking for a regexp to use with sed / awk / (e)grep (it actually doesn't matter to me which of these as all would be fine) to find the following in the above mentioned text:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
I want to get back a |100| line if it is followed by at least two |110| lines before another |100| line appears. The result should contain the initial |100| line together with all |110| lines that follow but not the following |100| line.
sed -ne '/|100|/,/|110|/p'
provides me a list of all |100| lines which are followed by at least one |110| line. But it doesn't check, if the |110| line has been repeated more than once. I get back results I don't look for.
sed -ne '/|100|/,/|100|/p'
returns a list of all |100| lines and the content between the next |100| line including the next |100| line.
Trying to find lines between search patterns always was a nightmare to me. I spent hours of try and error on similar problems which finally worked. But I never really understood why. I hope, s.o. might be so kind to save me of the headache this time and maybe explain how the pattern does the work. I'm quite sure, I'll face this kind of problem again and then I finally could help myself.
Thank you for any help on this one!
Regards
Manuel
I'd do this in awk.
awk -F'|' '$2==100&&c>2{print b} $2==100{c=1;b=$0;next} $2==110&&c{c++;b=b RS $0;next} {c=0}' file
Broken out for easier reading:
awk -F'|' '
# If we're starting a new section and conditions have been met, print buffer
$2==100 && c>2 {print b}
# Start a section with a new count and a new buffer...
$2==100 {c=1;b=$0;next}
# Add to buffer
$2==110 && c {c++;b=b RS $0}
# Finally, zero everything if we encounter lines that don't fit the pattern
{c=0;b=""}
' file
Rather than using a regex, this steps through the file using the field delimiters you've specified. Upon seeing the "start" condition, it begins keeping a buffer. As subsequent lines match your "continue" condition, the buffer grows. Once we see the start of a new section, we print the buffer if the the counter is big enough.
Works for me on your sample data.
Here's a GNU awk specific answer: use |100| as the record separator, |110| as the field separator, and look for records with at least 3 fields.
gawk '
BEGIN {
# a newline, the first pipe-delimited column, then the "100" value
RS="(\n[^|]+[|]100[|])"
FS="[|]110[|]"
}
NF >= 3 {print RT $0} # RT is the actual text matching the RS pattern
' file
In AWK, the field separator is set to a pipe character and the second field is compared to 100 and 110 per line. $0 represents a line from the input file.
BEGIN { FS = "|" }
{
if($2 == 100) {
one_hundred = 1;
one_hundred_one = 0;
var0 = $0
}
if($2 == 110) {
one_hundred_one += 1;
if(one_hundred_one == 1 && one_hundred = 1) var1 = $0;
if(one_hundred_one == 2 && one_hundred = 1) var2 = $0;
}
if(one_hundred == 1 && one_hundred_one == 2) {
print var0
print var1
print var2
}
}
awk -f foo.awk input.txt
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Linux grep command for words beginning with ? character

I'm struggling a bit with a grep command in an assignment.
I need to find every word starting with an 'a' in a document and then have word count determine how many that is. Since some words start with large letters I've done a tr 'A-Z' 'a-z'. I can easily get grep to find all the 'a' letters in the document and also lines starting with an 'a'. But for some reason I can't get it grep words that start with an 'a'.
Hope you can help me.
THX everybody this helped me out a lot
It is quite hard to understand Linux IMO but I'll get there eventually.
Again thx for all the help much appreciated.
You should be able to do
grep -Eow "[Aa]\w+" | wc -l
Which says find all words (-w) that begin with an "a" ([Aa]) and is followed by 1 or more word characters (\w+).
The -o options prints only matched output.
Example
echo " Aest test aest test" | grep -Eow "[Aa]\w+" | wc -l # returns 2
If you're using GNU awk, then change the record separator to any spaces (so each word becomes a record) and keep a count:
awk -v RS='\\s+' '/^[Aa]/ { ++count } END { print count + 0 }' file
The + 0 just makes the output a bit more clear in case there are no matches (it prints 0, rather than an empty string). More correct would be if (NR) print count + 0 so no input => no output but you might consider than overkill.
On other versions of awk, you could just loop through each word on the line manually:
awk '{ for (i = 1; i <= NF; ++i) if ($i ~ /^[Aa]/) ++count } END { print count + 0 }' file
adding counting option to Martins script.
grep -Eowc "[Aa]\w+"