Linux grep command for words beginning with ? character - regex

I'm struggling a bit with a grep command in an assignment.
I need to find every word starting with an 'a' in a document and then have word count determine how many that is. Since some words start with large letters I've done a tr 'A-Z' 'a-z'. I can easily get grep to find all the 'a' letters in the document and also lines starting with an 'a'. But for some reason I can't get it grep words that start with an 'a'.
Hope you can help me.
THX everybody this helped me out a lot
It is quite hard to understand Linux IMO but I'll get there eventually.
Again thx for all the help much appreciated.

You should be able to do
grep -Eow "[Aa]\w+" | wc -l
Which says find all words (-w) that begin with an "a" ([Aa]) and is followed by 1 or more word characters (\w+).
The -o options prints only matched output.
Example
echo " Aest test aest test" | grep -Eow "[Aa]\w+" | wc -l # returns 2

If you're using GNU awk, then change the record separator to any spaces (so each word becomes a record) and keep a count:
awk -v RS='\\s+' '/^[Aa]/ { ++count } END { print count + 0 }' file
The + 0 just makes the output a bit more clear in case there are no matches (it prints 0, rather than an empty string). More correct would be if (NR) print count + 0 so no input => no output but you might consider than overkill.
On other versions of awk, you could just loop through each word on the line manually:
awk '{ for (i = 1; i <= NF; ++i) if ($i ~ /^[Aa]/) ++count } END { print count + 0 }' file

adding counting option to Martins script.
grep -Eowc "[Aa]\w+"

Related

Using grep and regex to extract words from a file that contain only one kind of vowel

I have a large dictionary file that contains one word per line.
I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.
This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt
returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?
How about
grep -i '^[^aiou]*e[^aiou]*$'
?
Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.
awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
c = "[b-df-hj-np-tv-z]"
for (v in vowel)
regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
$0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
hit[v] = ( hit[v] ? hit[v] ORS : "") $0
next } }
END { for (v in vowel) {
printf "=== %s ===\n", vowel[v]
print hit[v] } }' /usr/share/dict/words
You'll notice that it prints words with syllabic y like jolly and cycle. A more complex regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.
The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.
Demo: https://ideone.com/wNrvPu
Using -P (perl) option:
^(?=.*e)[^aiou]+$
Explanation:
^ # beginning of line
(?=.*e) # positive lookahead, make sure we at least 1 "e"
[^aiou]+ # 1 or more any character that is not vowel
$ # end of line
cat file.txt
see
best
levee
whenever
like
house
and
BBC
BMW
grep -P '^(?=.*e)[^aiou]+$' file.txt
see
best
levee
whenever

BASH: Search a string and exactly display the exact number of times a substring happens inside it

I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.
sample.txt is:
AAAAA
I want to find the exact times the combination "AAA" happens. If you just use for example
grep -o 'AAA' sample.txt | wc -l
We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.
I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.
How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.
This might work for you (GNU sed & wc):
sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l
Lose any characters other than A's, and single or double A's.Then print a triple A and lose the first A and repeat. Finally count the number of lines printed.
This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:
#!/bin/bash
search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0
# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do
# Look at just the length of the search term
text_substr=${text:0:${term_len}}
# If we see the search term, increment occurences
if [ "$text_substr" = "$search_term" ]; then
((occurences++))
fi
# Remove the first character from the main text
# (e.g. "AAAAA" becomes "AAAA")
text=${text:1}
done
printf "%d occurences of %s\n" "$occurences" "$search_term"
This is the awk version
echo "AAAAA AAA AAAABBAAA" \
| gawk -v pat="AAA" '{
for(i=1; i<=NF; i++){
# current field length
m=length($i)
#search pattern length
n=length(pat)
for(l=1 ; l<m; l++){
sstr=substr($i,l,n)
#print i " " $i " sub:" sstr
# substring matches pattern
if(sstr ~ pat){
count++
}else{
print "contiguous count on field " i " = " count
# uncomment next line if non-contiguous matches are not needed
#break
}
}
print "total count on field " i " = " count
count=0
}
}'
I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.
$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
while(t=index(x,$0)){ n++; x=substr(x,t+1) }
print $0,n
}' combinations
Explanation:
The variables:
sample: is the raw sample text slurp in from the file sample.txt with the -v argument
x: is the targeting string, before each test, the value is reset to sample
$0: is the testing string from the file combination, each line feeds a testing string
n: is the counter, number of occurences of the testing string($0)
t: is the position of the first character of the matched testing string($0) in the targeting string(x)
Update: Added $0 != "" before the main while loop to skip EMPTY strings which lead to unlimited loop.
The code:
awk -v sample="$(<sample.txt)" '
# reset the targeting string(with the sample text) and the counter "n"
{ x = sample; n = 0 }
# below the main block where $0 != "" to skip the EMPTY testing string
($0 != ""){
# the function index(x, $0) returns the position(assigned to "t") of the first character
# of the matched testing string($0) in the targeting string(x).
# when no match is found, it returns zero and thus step out of the while loop.
while(t=index(x,$0)) {
n++; # increment the number of matches
x = substr(x, t+1) # modify the targeting string to remove all characters before the position(t) inclusively
}
print $0, n # print the testing string and the counts
}
' combinations
awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:
$ more sample.txt
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH
$ more combinations
AA
HH
AAA
HHH
AAH
HHA
ZK
Tested Environment: GNU Awk 4.0.2, Centos 7.3

shell regex not found (number, length and K letter)

my code or regex not found, im tried with:
'^([0-9]{7,8})+([K|0-9]{1})'
'#[0-9]{7,8}[K 0-9]{1}'
'#[0-9]{7,8}[K 0-9]{1}'
"^([0-9]{7,8})+([K|0-9]{1})$"
*i need return (1234567 or 12345678) + K or number (not found \d)
ex: 123456789 12345678K 12345678
*with this regex '^[0-9]{7,8}[K|0-9]{1}' return:
ok 184587939
ok 17678977K
ok 186613074
ok 18661307Z (dont work the last digit)
invalido 18R613074
ok 1845879398888888 (not length found)
ok 18458793U
invalido 18661G074
invalido 18661G07T
ok 18458793
invalido 1845E793
#!/bin/bash
var='^([0-9]{7,8})+([K|0-9]{1})$'
for LINEA in `cat rut.txt ` #LINEA guarda el resultado del fichero rut.txt
do
rut=`echo $LINEA | cut -d ":" -f1` #Extrae rut
rut=$(echo $rut | tr 'a-z' 'A-Z')
while :
do
if [[ $rut =~ $var ]];then ()
#$rut >> rutok.txt
echo $rut | cat >> rutok.txt
echo $rut' ok'
break
else
#$rut >> rutinv.txt
echo $rut | cat >> rutinv.txt
echo $rut' inv'
break
fi
done
done
exit 0
If Iunderstand you correctly, I believe this is what you're after.
^([0-9]{8}|[0-9]{7}K)$
Your question is extremely unclear, but in addition, your code is incredibly overcomplicated. You seem to be looking for simply grep and grep -v. Looping over the lines in a file with for is definitely wrong (see http://mywiki.wooledge.org/DontReadLinesWithFor) but a while read -r loop is also usually an antipattern. In the general case, these are usually examples of an Awk script begging to be born.
cut -d : -f1 rut.txt |
tr A-Z a-z |
awk '/^([0-9]{7,8})([KZ0-9])$/ { print >>"rutok.txt"; next }
{ print >> "rutinv.txt" }'
The {1} repetition is just superfluous (otherwise we'd have to write h{1}t{1}t{1}p{1} etc to match a simple string!) and inside a character class, you just enumerate the character ranges; if you don't want to match a literal | character, don't put it in the character class.
([0-9]{7,8})+ means one or more repetitions of 7 or 8 numbers; so 7, 8, 14, 15, 16, 21, ... repetitions. According to the exposition, this is not what you want.
I'm guessing the regex above covers your case: seven or eight digits and a K or one more digit.
Moreover, returning to your shell script, you should not use uppercase for your own variables (these are reserved for system use); and you need to quote your strings unless you specifically want the shell to perform token splitting and wildcard expansion on the values (though that's probably likely to be harmless here). Finally, the while loop does not seem to serve any useful purpose, as you break out of it on the first iteration in every case.
You can try this:
^([0-9]{7,8})([k0-9])$
For
184587939
17678977K
186613074
and ^([0-9]{7,8})([a-z0-9])$ For
184587939
17678977K
186613074
17678977A
18661307B
17678977U
18661307L

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Linux: counting spaces and other characters in file

Problem:
I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like
cat -vte
and
od -c
and
wc -l ( or wc -c )
However, I'd like to know the exact number of leading and trailing spaces between characters
and sections of text. Tabs as well.
Question:
How would you go about analyzing then matching a template exactly using common unix
tools + perl or python? One-liners preferred. Also, what's your advice for matching
a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?
UPDATE
Using this to see individual spaces [ assumes no '%' chars in file ]:
sed 's/ /%/g' filename.000
Plan to build a script that analyzes each line's tab and space content.
Using #shiplu's solution with a nod to the anti-cat crowd:
while read l;do echo $l;echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000
Still needs some tweaks for Windows but it's well on it's way.
SAMPLE TEXT
Key for reading:
newlines marked with \n
Carriage returns marked with \r
Unknown space/tab characters marked with [:space:] ( need counts on those )
\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK 99999\r\n
\n
\n
[:space:] 10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:] D_ \r[:space:] _O\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:] Pantz McManliss\r\n
[:space:] Gibberish Ave\r\n
[:space:] Northern Mirkwood, ME 99999\r\n
( untold variable amounts of \n chars go here )
UPDATE 2
Using IFS with read gives similar results to the ruby posted by someone below.
while IFS='' read -r line
do
printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
done < filename.000
perl -nlE'say 0+( () = /\s/g );'
Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.
Idioms used:
0+( ... ) imposes scalar context like scalar( ... ), but it's clearer because it tells the reader a number is expected.
List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched.
-l, when used with -n, will cause the input to be "chomped", so this removes line feeds from the count.
If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:
perl -nE'say tr/ \t//;'
In both cases, you can pass the input via STDIN or via a file named by an argument.
Regular expressions in Perl or Python would be the way to go here.
Perl Regular Expressions
Python Regular Expressions
Regular Expressions Cheat Sheet
Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.
perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt
This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:
foo bar
Will print
foo bar
Count: 4
Count: 8
You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo. If so, replace + with {2,} or whatever minimum you think is appropriate.
counting blanks:
sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c
before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?
If you want to count the number of spaces in pm.txt, this command will do,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));
done;
If you want to count the number of spaces, \r, \n, \t use this,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;
read will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using
`split -l 1 -d pm.txt`.
After that there will be bunch of x* files. Now loop through it.
for x in x*; do echo $((`cat $x | wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;
Remove the those files by rm x*;
In case Ruby counts (it does count :)
ruby -lne 'puts scan(/\s/).size'
and now some Perl (slightly less intuitive IMHO):
perl -lne 'print scalar(#{[/(\s)/g]})'
If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.