grep not counting characters accurately when they are clearly in file - regex

I'm trying to count the number of times '(' appears in a file. I get a number back but it's never accurate.
Why won't grep accuratly count the occurences of this character. It must be multiline and every occurrences.
I imagine my regex is off, but it's so simple.
log.txt:
(eRxîó¬Pä^oË'AqŠêêÏ-04ây9Í&ñ­ÖbèaïÄ®h0FºßôÊ$&Ð>0dÏ“ ²ˆde^áä­ÖÚƒíZÝ*ö¨tM
variable 1
paren )
(¼uC¼óµr\=Œ"J§ò<ƒu³ÓùËP
<åÐ#ô{ô
½ÊªÆÏglTµ¥>¦³éùoÏWÛz·ób(ÈIH|TT]
variable 0
paren )
Output:
$ grep -o "(" log.txt | wc -l
1
EDIT:
I had a wierd mix of encoding so I dump it then count the hex values.
hexdump -C hex.txt | grep "28" | wc -l

You might have encoding issues, if you interpret a single-byte encoding in a multibyte locale. Here's an approach that deletes everything except ( (in a single-byte locale), then counts the remaining characters:
LC_ALL=C <log.txt tr -c -d '(' | wc -c

Dump the unknown encoding and count hex values.
hexdump -C hex.txt | grep "28" | wc -l

using sed (than counting with wc because only in sed it's a bit heavy for that)
sed -e '1h;1!h;$!d' -e 'x;s/[^(]//g' yourfile | wc -c
using awk
awk -F '(' '{ Total += NF - 1 } END { print Total }' YourFile

Related

trying to find number of directories in a path using a regexp

I need to find how many directories are in a given path.
For example testdir1/testdir2/testdir3/ should return three directories accounting for the last / that has no text after it.
This is all in a bash environment.
This is what I tried and came up with, and somewhat works, but I get four directories instead of three:
tr '/.' '\n' <<< testdir1/testdir2/testdir3/ | wc -l
How would I write the task find all /'s except for the one at the end without any text after it?
Your help would be greatly appreciated.
You can replace the herestring <<< with ProcessSubstituion <( ), because the herestring adds an additional newline
tr '/' '\n' < <(printf '%s' testdir1/testdir2/testdir3/) | wc -l
Output
3
Or just remove the trailing slash /
tr '/' '\n' <<< testdir1/testdir2/testdir3 | wc -l
Output
3
Using AWK
:=>echo "testdir1/testdir2/testdir3/" | awk -F'/' '{ if ($NF =="") print NF-1; else NF }'
3
:=>
Explanation:
awk -F'/' -- Set field seprator as /
{ if ($NF =="") -- NF is number of field in current record. $NF -- value is last field
print NF-1; -- IF last field is empty print NF-1 else all fields
Edit: Using grep
:=>echo "testdir1/testdir2/testdir3/" | grep -o '/' | wc -l
3
Would you please try the following:
tr -cd "/" <<< "testdir1/testdir2/testdir3/" | wc -c
tr -cd "/" removes all characters of the input other than "/" and the output will be ///. Then wc -c counts the number of bytes.

grep command to find out how many times any character is followed by '.'

I have to find out how often any character is followed by a period (.) with the help of grep. After finding how many times character is followed by period and then I have to sort the result in ascending order.
For example in this string: "Find my input. Output should be obtained. You need to find output."
The output should be something like this:
d 1
t 2
What I have done so far :
cat filename | grep -o "*." | sort -u
But it is not working as intended.
Any ideas how to solve this? I have to perform this operation on huge library of books in .txt files.
An iterative approach with GNU grep:
grep -o '.\.' filename | sort | uniq -c
Output:
1 d.
2 t.
grep -Po '.(?=\.)' filename | sort | uniq -c
Output:
1 d
2 t
grep -Po '.(?=\.)' filename | sort | uniq -c | awk '{print $2,$1}'
Output:
d 1
t 2
With single GNU awk process:
awk -v FPAT='.[.]' 'BEGIN{ PROCINFO["sorted_in"]="#ind_str_asc" }
{ for(i=1;i<=NF;i++) a[substr($i,1,1)]++ }
END{ for(i in a) print i,a[i] }' filename
The output:
d 1
t 2
This one is ok too
echo "Find my input. Output should be obtained. You need to find output."| grep -o ".\." | sort | uniq -c | rev | tr -d .

Using math and grep

I need to find the number of invalid email addresses in a file, basically any line that contains "#" but is not in the correct format.
I am using this to count the number of valid email addresses:
grep -Ei '[A-Z0-9.-]+#[A-Z0-9.-]+\.[A-Z]{3}' $1 | wc -l
and this to calculate how many lines contain #:
grep -E '#' $1 | wc -l
is there a way I can possibly subtract the number of lines that contain # anywhere and the number of valid emails before printing with wc -l?
grep has -c option to just print the occurrences, you should leverage that instead of spawning another process and an anonymous pipe:
grep -c '<pattern>' file.txt
To subtract the counts from two searches, you can can directly subtract them leveraging command substitution:
echo $(( $(grep -c '<pattern_1>' file.txt) - $(grep -c '<pattern_2>' file.txt) ))
If you fancy, you can use two variables as well:
count_1=$(grep -c '<pattern_1>' file.txt)
count_2=$(grep -c '<pattern_2>' file.txt)
echo $(( count_1 - count_2 ))

grep matching but not printing if line end in dos ^M

I need to search in multiple files for a PATTERN, if found display the file, line and PATTERN surrounded by a few extra chars. My problem is that if the line matching the PATTERN ends with ^M (CRLF) grep prints an empty line instead.
Create a file like this, first line "a^M", second line "a", third line empty line, forth line "a" (not followed by a new line).
a^M
a
a
Without trying to match a few chars after the PATTERN all occurrences are found and displayed:
# grep -srnoEiI ".{0,2}a" *
1:a
2:a
4:a
If I try to match any chars at the end of the PATTERN, it prints an empty line instead of line one, the one ending in CRLF:
# grep -srnoEiI ".{0,2}a.{0,2}" *
2:a
4:a
How can I change this to act as expected ?
P.S. I will like to fix this grep, but I will accept other solutions for example in awk.
EDIT:
Based on the answers below I choose to strip the \r and force grep to pipe the colors to tr:
grep --color=always -srnoEiI ".{0,2}a.{0,2}" * | tr -d '\r'
Here's a simpler case that reproduces your problem:
# Output
echo $'a\r' | grep -o "a"
# No output
echo $'a\r' | grep -o "a."
This is beacuse the ^M matches like a regular character, and makes your terminal overwrite its output (this is purely cosmetic).
How you want to fix this depends on what you want to do.
# Show the output in hex format to ensure it's correct
$ echo $'a\r' | grep -o "a." | od -t x1 -c
0000000 61 0d 0a
a \r \n
# Show the output in visually less ambiguous format
$ echo $'a\r' | grep -o "a." | cat -v
a^M
# Strip the carriage return
$ echo $'a\r' | grep -o "a." | tr -d '\r'
a
awk -v pattern="a" '$0 ~ pattern && !/\r$/ {print NR ": " $0}' file
or
sed -n '/a/{/\r$/!{=;p}}' ~/tmp/srcfile | paste -d: - -
Both of these do: find the pattern, see if the line does not end in a carriage return, print the line number and the line. For the sed, the line number is on its own line, so we have to join two consecutive lines with a colon.
You could use pcregrep:
pcregrep -n '.{0,2}a.{0,2}' inputfile
For your sample input:
$ printf $'a\r\na\n\na\n' | pcregrep -n '.{0,2}a.{0,2}'
1:a
2:a
4:a
A couple more ways:
Use the dos2unix utility to convert the dos-style line endings to unix-style:
dos2unix myfile.txt
Or preprocess the file using tr to remove the CR characters, then pipe to grep:
$ tr -d '\r' < myfile.txt | grep -srnoEiI ".{0,2}a.{0,2}"
1:a
2:a
4:a
$
Note dos2unix may need to be installed on whatever OS you are using. More than likely tr will be available on any POSIX-compliant OS.
You can use awk with a custom field separator:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file
TESTING:
Your grep command:
grep -srnoEiI ".{0,2}a.{0,2}" file|cat -vte
file:1:a^M$
file:2:a$
file:4:a$
Suggested awk commmand:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file|cat -vte
file:1:a$
file:2:a$
file:4:a$

Counting regex pattern matches in one line using sed or grep?

I want to count the number of matches there is on one single line (or all lines as there always will be only one line).
I want to count not just one match per line as in
echo "123 123 123" | grep -c -E "123" # Result: 1
Better example:
echo "1 1 2 2 2 5" | grep -c -E '([^ ])( \1){1}' # Result: 1, expected: 2 or 3
You could use grep -o then pipe through wc -l:
$ echo "123 123 123" | grep -o 123 | wc -l
3
Maybe below:
echo "123 123 123" | sed "s/123 /123\n/g" | wc -l
( maybe ugly, but my bash fu is not that great )
Maybe you should convert spaces to newlines first:
$ echo "1 1 2 2 2 5" | tr ' ' $'\n' | grep -c 2
3
Why not use awk?
You could use awk '{print gsub(your_regex,"&")}'
to print the number of matches on each line, or
awk '{c+=gsub(your_regex,"&")}END{print c}'
to print the total number of matches. Note that relative speed may vary depending on which awk implementation is used, and which input is given.
This might work for you:
sed -n -e ':a' -e 's/123//p' -e 'ta' file | sed -n '$='
GNU sed could be written:
sed -n ':;s/123//p;t' file | sed -n '$='