uniq treats lines as equal when they are not - uniq

I would expect different output from this command:
$ echo -e "あいうえお\nオエウイア" | uniq -c
2 あいうえお
The two lines are not the same.
Compare to this example, working as expected:
$ echo -e "aiueo\noeuia" | uniq -c
1 aiueo
1 oeuia
Is this a Unicode or UTF-8 issue? I did not find any option to support "exotic" characters.
Edit: I am experiencing a similar problem when using sort with japanese input. Input of the form a\nb\na\nb\n (or, ommiting '\n', abab) stays that way, I would expect it to be aabb or at least bbaa.

There you go - echo -e "あいうえお\nオエウイア" | uni2ascii -q | uniq -c | ascii2uni

Related

grep command to find out how many times any character is followed by '.'

I have to find out how often any character is followed by a period (.) with the help of grep. After finding how many times character is followed by period and then I have to sort the result in ascending order.
For example in this string: "Find my input. Output should be obtained. You need to find output."
The output should be something like this:
d 1
t 2
What I have done so far :
cat filename | grep -o "*." | sort -u
But it is not working as intended.
Any ideas how to solve this? I have to perform this operation on huge library of books in .txt files.
An iterative approach with GNU grep:
grep -o '.\.' filename | sort | uniq -c
Output:
1 d.
2 t.
grep -Po '.(?=\.)' filename | sort | uniq -c
Output:
1 d
2 t
grep -Po '.(?=\.)' filename | sort | uniq -c | awk '{print $2,$1}'
Output:
d 1
t 2
With single GNU awk process:
awk -v FPAT='.[.]' 'BEGIN{ PROCINFO["sorted_in"]="#ind_str_asc" }
{ for(i=1;i<=NF;i++) a[substr($i,1,1)]++ }
END{ for(i in a) print i,a[i] }' filename
The output:
d 1
t 2
This one is ok too
echo "Find my input. Output should be obtained. You need to find output."| grep -o ".\." | sort | uniq -c | rev | tr -d .

Using math and grep

I need to find the number of invalid email addresses in a file, basically any line that contains "#" but is not in the correct format.
I am using this to count the number of valid email addresses:
grep -Ei '[A-Z0-9.-]+#[A-Z0-9.-]+\.[A-Z]{3}' $1 | wc -l
and this to calculate how many lines contain #:
grep -E '#' $1 | wc -l
is there a way I can possibly subtract the number of lines that contain # anywhere and the number of valid emails before printing with wc -l?
grep has -c option to just print the occurrences, you should leverage that instead of spawning another process and an anonymous pipe:
grep -c '<pattern>' file.txt
To subtract the counts from two searches, you can can directly subtract them leveraging command substitution:
echo $(( $(grep -c '<pattern_1>' file.txt) - $(grep -c '<pattern_2>' file.txt) ))
If you fancy, you can use two variables as well:
count_1=$(grep -c '<pattern_1>' file.txt)
count_2=$(grep -c '<pattern_2>' file.txt)
echo $(( count_1 - count_2 ))

grep not counting characters accurately when they are clearly in file

I'm trying to count the number of times '(' appears in a file. I get a number back but it's never accurate.
Why won't grep accuratly count the occurences of this character. It must be multiline and every occurrences.
I imagine my regex is off, but it's so simple.
log.txt:
(eRxîó¬Pä^oË'AqŠêêÏ-04ây9Í&ñ­ÖbèaïÄ®h0FºßôÊ$&Ð>0dÏ“ ²ˆde^áä­ÖÚƒíZÝ*ö¨tM
variable 1
paren )
(¼uC¼óµr\=Œ"J§ò<ƒu³ÓùËP
<åÐ#ô{ô
½ÊªÆÏglTµ¥>¦³éùoÏWÛz·ób(ÈIH|TT]
variable 0
paren )
Output:
$ grep -o "(" log.txt | wc -l
1
EDIT:
I had a wierd mix of encoding so I dump it then count the hex values.
hexdump -C hex.txt | grep "28" | wc -l
You might have encoding issues, if you interpret a single-byte encoding in a multibyte locale. Here's an approach that deletes everything except ( (in a single-byte locale), then counts the remaining characters:
LC_ALL=C <log.txt tr -c -d '(' | wc -c
Dump the unknown encoding and count hex values.
hexdump -C hex.txt | grep "28" | wc -l
using sed (than counting with wc because only in sed it's a bit heavy for that)
sed -e '1h;1!h;$!d' -e 'x;s/[^(]//g' yourfile | wc -c
using awk
awk -F '(' '{ Total += NF - 1 } END { print Total }' YourFile

Seletively extract number from file name

I have a list of files in the format as: AA13_11BB, CC290_23DD, EE92_34RR. I need to extract only the numbers after the _ character, not the ones before. For those three file names, I would like to get 11, 23, 34 as output and after each extraction, store the number into a variable.
I'm very new to bash and regex. Currently, from AA13_11BB, I am able to either obtain 13_11:
for imgs in $DIR; do
LEVEL=$(echo $imgs | egrep -o [_0-9]+);
done
or two separate numbers 13 and 11:
LEVEL=$(echo $imgs | egrep -o [0-9]+)
May I please have some advice how to obtain my desired output? Thank you!
Use egrep with sed:
LEVEL=$(echo $imgs | egrep -o '_[0-9]+' | sed 's/_//' )
To complement the existing helpful answers, using the core of hjpotter92's answer:
The following processes all filenames in $DIR in a single command and reads all extracted tokens into array:
IFS=$'\n' read -d '' -ra levels < \
<(printf '%s\n' "$DIR"/* | egrep -o '_[0-9]+' | sed 's/_//')
IFS=$'\n' read -d '' -ra levels splits the input into lines and stores them as elements of array ${levels[#]}.
<(...) is a process substitution that allows the output from a command to act as an (ephemeral) input file.
printf '%s\n' "$DIR"/* uses pathname expansion to output each filename on its own line.
egrep -o '_[0-9]+' | sed 's/_//' is the same as in hjpotter92's answer - it works equally on multiple input lines, as is the case here.
To process the extracted tokens later, use:
for level in "${levels[#]}"; do
echo "$level" # work with $level
done
You can do it in one sed using the regex .*_([0-9]+).* (escape it properly for sed):
sed "s/.*_\([0-9]\+\).*/\1/" <<< "AA13_11BB"
It replaces the line with the first captured group (the sub-regex inside the ()), outputting:
11
In your script:
LEVEL=$(sed "s/.*_\([0-9]\+\).*/\1/" <<< $imgs)
Update: as suggested by #mklement0, in both BSD sed and GNU sed you can shorten the command using the -E parameter:
LEVEL=$(sed -E "s/.*_([0-9]+).*/\1/" <<< $imgs)
Using grep with -P flag
for imgs in $DIR
do
LEVEL=$(echo $imgs | grep -Po '(?<=_)[0-9]{2}')
echo $LEVEL
done

Why is this grep filter slow?

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.
Without the one-letter exclusion it runs extremely fast:
time cat /usr/share/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.227s
user 0m0.375s
sys 0m0.021s
grepping on '..', however, is painfully slow:
time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 1m16.319s
user 1m0.694s
sys 0m10.225s
What's going on here?
The problem is the UTF-8 Locale, easy workaround for 100x speedup
What's really slow on the Mac is the UTF-8 locale.
Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.
This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.
I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v, and throw away all one-character lines:
$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.086s
user 0m0.090s
sys 0m0.000s
This might run a little better and would also get rid of your cut needing another pipe.
cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
it might even be faster if you cut down on the use of excessive pipes and useless cat
$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file