Using math and grep - regex

I need to find the number of invalid email addresses in a file, basically any line that contains "#" but is not in the correct format.
I am using this to count the number of valid email addresses:
grep -Ei '[A-Z0-9.-]+#[A-Z0-9.-]+\.[A-Z]{3}' $1 | wc -l
and this to calculate how many lines contain #:
grep -E '#' $1 | wc -l
is there a way I can possibly subtract the number of lines that contain # anywhere and the number of valid emails before printing with wc -l?

grep has -c option to just print the occurrences, you should leverage that instead of spawning another process and an anonymous pipe:
grep -c '<pattern>' file.txt
To subtract the counts from two searches, you can can directly subtract them leveraging command substitution:
echo $(( $(grep -c '<pattern_1>' file.txt) - $(grep -c '<pattern_2>' file.txt) ))
If you fancy, you can use two variables as well:
count_1=$(grep -c '<pattern_1>' file.txt)
count_2=$(grep -c '<pattern_2>' file.txt)
echo $(( count_1 - count_2 ))

Related

How to zero pad a sequence of integers seperated with dot in bash

I need to shape versions in a specific format.
For instance:
V1=1.0.1
V2=4.0.1
V3=3.1.101
...
Need to be pad with 0 as follow:
V1=001.000.001.000
V2=004.000.001.000
V3=003.001.101.000
...
Any idea on how i can do that?
EDIT:
I succeed using printf as follow:
printf "%03d.%03d.%03d.000\n" $(echo $V3 | grep -o '[^-]*$' | cut -d. -f1) $(echo $V3 | grep -o '[^-]*$' | cut -d. -f2) $(echo $V3 | grep -o '[^-]*$' | cut -d. -f3)
output:
003.001.101.000
Any better suggestions ?
Let's try with sed taking a text file listing the versions as input, named versions.txt. I split the instructions for simplicity:
# Add '00' before each sub-version number
sed -i -r 's/([=\.])([0-9])/\100\2/g' versions.txt
# Remove '00' if sub-version number had 3 digits
sed -i -r 's/([=\.])00([0-9]{3})/\1\2/g' versions.txt
# Remove '0' if sub-version number had 2 digits
sed -i -r 's/([=\.])00([0-9]{2})/\10\2/g' versions.txt
# Add the final '.000' after each version
sed -i -r 's/([0-9]{3}\.[0-9]{3}\.[0-9]{3})/\1\.000/g' versions.txt
Another sed-approach:
sed -r 's/\b([0-9]{1})(\.|$)/00\1\2/g;s/\b([0-9]{2})(\.|$)/0\1\2/g;s/(([0-9]{3}\.|$){3})/\1.000/g'
You can try with awk:
awk -F'[=.]' '{ # Set field delimiter to = and .
split($0, a, FS, seps) # Get all elements and separator into an array
for(i=1;i<=5;i++) { # Loop though all fields
if(i>1)
a[i]=sprintf("%03d",$i) # Update the version number with 3 digits
printf "%s%s", a[i], seps[i]} # Print the field
print "" # print a newline
}' file
If the version are in bash variable, you could use an easier awk one liner:
V3="3.1.101"; awk -F. '{for(i=1;i<5;i++){$i=sprintf("%03d",$i)}}1' OFS='.' <<<$V3

grep not counting characters accurately when they are clearly in file

I'm trying to count the number of times '(' appears in a file. I get a number back but it's never accurate.
Why won't grep accuratly count the occurences of this character. It must be multiline and every occurrences.
I imagine my regex is off, but it's so simple.
log.txt:
(eRxîó¬Pä^oË'AqŠêêÏ-04ây9Í&ñ­ÖbèaïÄ®h0FºßôÊ$&Ð>0dÏ“ ²ˆde^áä­ÖÚƒíZÝ*ö¨tM
variable 1
paren )
(¼uC¼óµr\=Œ"J§ò<ƒu³ÓùËP
<åÐ#ô{ô
½ÊªÆÏglTµ¥>¦³éùoÏWÛz·ób(ÈIH|TT]
variable 0
paren )
Output:
$ grep -o "(" log.txt | wc -l
1
EDIT:
I had a wierd mix of encoding so I dump it then count the hex values.
hexdump -C hex.txt | grep "28" | wc -l
You might have encoding issues, if you interpret a single-byte encoding in a multibyte locale. Here's an approach that deletes everything except ( (in a single-byte locale), then counts the remaining characters:
LC_ALL=C <log.txt tr -c -d '(' | wc -c
Dump the unknown encoding and count hex values.
hexdump -C hex.txt | grep "28" | wc -l
using sed (than counting with wc because only in sed it's a bit heavy for that)
sed -e '1h;1!h;$!d' -e 'x;s/[^(]//g' yourfile | wc -c
using awk
awk -F '(' '{ Total += NF - 1 } END { print Total }' YourFile

Seletively extract number from file name

I have a list of files in the format as: AA13_11BB, CC290_23DD, EE92_34RR. I need to extract only the numbers after the _ character, not the ones before. For those three file names, I would like to get 11, 23, 34 as output and after each extraction, store the number into a variable.
I'm very new to bash and regex. Currently, from AA13_11BB, I am able to either obtain 13_11:
for imgs in $DIR; do
LEVEL=$(echo $imgs | egrep -o [_0-9]+);
done
or two separate numbers 13 and 11:
LEVEL=$(echo $imgs | egrep -o [0-9]+)
May I please have some advice how to obtain my desired output? Thank you!
Use egrep with sed:
LEVEL=$(echo $imgs | egrep -o '_[0-9]+' | sed 's/_//' )
To complement the existing helpful answers, using the core of hjpotter92's answer:
The following processes all filenames in $DIR in a single command and reads all extracted tokens into array:
IFS=$'\n' read -d '' -ra levels < \
<(printf '%s\n' "$DIR"/* | egrep -o '_[0-9]+' | sed 's/_//')
IFS=$'\n' read -d '' -ra levels splits the input into lines and stores them as elements of array ${levels[#]}.
<(...) is a process substitution that allows the output from a command to act as an (ephemeral) input file.
printf '%s\n' "$DIR"/* uses pathname expansion to output each filename on its own line.
egrep -o '_[0-9]+' | sed 's/_//' is the same as in hjpotter92's answer - it works equally on multiple input lines, as is the case here.
To process the extracted tokens later, use:
for level in "${levels[#]}"; do
echo "$level" # work with $level
done
You can do it in one sed using the regex .*_([0-9]+).* (escape it properly for sed):
sed "s/.*_\([0-9]\+\).*/\1/" <<< "AA13_11BB"
It replaces the line with the first captured group (the sub-regex inside the ()), outputting:
11
In your script:
LEVEL=$(sed "s/.*_\([0-9]\+\).*/\1/" <<< $imgs)
Update: as suggested by #mklement0, in both BSD sed and GNU sed you can shorten the command using the -E parameter:
LEVEL=$(sed -E "s/.*_([0-9]+).*/\1/" <<< $imgs)
Using grep with -P flag
for imgs in $DIR
do
LEVEL=$(echo $imgs | grep -Po '(?<=_)[0-9]{2}')
echo $LEVEL
done

Bash: how to take a number from string? (regular expression maybe)

I want to get a count of symbols in a file.
wc -c f1.txt | grep [0-9]
But this code return a line where grep found numbers. I want to retrun only 38. How?
You can use awk:
wc -c f1.txt | awk '{print $1}'
OR using grep -o:
wc -c f1.txt | grep -o "[0-9]\+"
OR using bash regex capabilities:
re="^ *([0-9]+)" && [[ "$(wc -c f1.txt)" =~ $re ]] && echo "${BASH_REMATCH[1]}"
pass data to wc from stdin instead of a file: nchars=$(wc -c < f1.txt)

Get total number of matches for a regex via standard unix command

Let's say that I want to count the number of "o" characters in the text
oooasdfa
oasoasgo
My first thought was to do grep -c o, but this returns 2, because grep returns the number of matching lines, not the total number of matches. Is there a flag I can use with grep to change this? Or perhaps I should be using awk, or some other command?
This will print the number of matches:
echo "oooasdfa
oasoasgo" | grep -o o | wc -l
you can use the shell (bash)
$ var=$(<file)
$ echo $var
oooasdfa oasoasgo
$ o="${var//[^o]/}"
$ echo ${#o}
6
awk
$ awk '{m=gsub("o","");sum+=m}END{print sum}' file
6