uniq treats lines as equal when they are not

uniq treats lines as equal when they are not - uniq

I would expect different output from this command:
$ echo -e "あいうえお\nオエウイア" | uniq -c
2 あいうえお
The two lines are not the same.
Compare to this example, working as expected:
$ echo -e "aiueo\noeuia" | uniq -c
1 aiueo
1 oeuia
Is this a Unicode or UTF-8 issue? I did not find any option to support "exotic" characters.
Edit: I am experiencing a similar problem when using sort with japanese input. Input of the form a\nb\na\nb\n (or, ommiting '\n', abab) stays that way, I would expect it to be aabb or at least bbaa.

There you go - echo -e "あいうえお\nオエウイア" | uni2ascii -q | uniq -c | ascii2uni

Related

grep command to find out how many times any character is followed by '.'

I have to find out how often any character is followed by a period (.) with the help of grep. After finding how many times character is followed by period and then I have to sort the result in ascending order.
For example in this string: "Find my input. Output should be obtained. You need to find output."
The output should be something like this:
d 1
t 2
What I have done so far :
cat filename | grep -o "*." | sort -u
But it is not working as intended.
Any ideas how to solve this? I have to perform this operation on huge library of books in .txt files.

An iterative approach with GNU grep:
grep -o '.\.' filename | sort | uniq -c
Output:
1 d.
2 t.
grep -Po '.(?=\.)' filename | sort | uniq -c
Output:
1 d
2 t
grep -Po '.(?=\.)' filename | sort | uniq -c | awk '{print $2,$1}'
Output:
d 1
t 2

With single GNU awk process:
awk -v FPAT='.[.]' 'BEGIN{ PROCINFO["sorted_in"]="#ind_str_asc" }
{ for(i=1;i<=NF;i++) a[substr($i,1,1)]++ }
END{ for(i in a) print i,a[i] }' filename
The output:
d 1
t 2

This one is ok too
echo "Find my input. Output should be obtained. You need to find output."| grep -o ".\." | sort | uniq -c | rev | tr -d .

Using math and grep

I need to find the number of invalid email addresses in a file, basically any line that contains "#" but is not in the correct format.
I am using this to count the number of valid email addresses:
grep -Ei '[A-Z0-9.-]+#[A-Z0-9.-]+\.[A-Z]{3}' $1 | wc -l
and this to calculate how many lines contain #:
grep -E '#' $1 | wc -l
is there a way I can possibly subtract the number of lines that contain # anywhere and the number of valid emails before printing with wc -l?

grep has -c option to just print the occurrences, you should leverage that instead of spawning another process and an anonymous pipe:
grep -c '<pattern>' file.txt
To subtract the counts from two searches, you can can directly subtract them leveraging command substitution:
echo $(( $(grep -c '<pattern_1>' file.txt) - $(grep -c '<pattern_2>' file.txt) ))
If you fancy, you can use two variables as well:
count_1=$(grep -c '<pattern_1>' file.txt)
count_2=$(grep -c '<pattern_2>' file.txt)
echo $(( count_1 - count_2 ))

grep not counting characters accurately when they are clearly in file

I'm trying to count the number of times '(' appears in a file. I get a number back but it's never accurate.
Why won't grep accuratly count the occurences of this character. It must be multiline and every occurrences.
I imagine my regex is off, but it's so simple.
log.txt:
(eRxîó¬Pä^oË'AqŠêêÏ-04ây9Í&ñÖbèaïÄ®h0FºßôÊ$&Ð>0dÏ“ ²ˆde^áäÖÚƒíZÝ*ö¨tM
variable 1
paren )
(¼uC¼óµr\=Œ"J§ò<ƒu³ÓùËP
<åÐ#ô{ô
½ÊªÆÏglTµ¥>¦³éùoÏWÛz·ób(ÈIH|TT]
variable 0
paren )
Output:
$ grep -o "(" log.txt | wc -l
1
EDIT:
I had a wierd mix of encoding so I dump it then count the hex values.
hexdump -C hex.txt | grep "28" | wc -l

You might have encoding issues, if you interpret a single-byte encoding in a multibyte locale. Here's an approach that deletes everything except ( (in a single-byte locale), then counts the remaining characters:
LC_ALL=C <log.txt tr -c -d '(' | wc -c

Dump the unknown encoding and count hex values.
hexdump -C hex.txt | grep "28" | wc -l

using sed (than counting with wc because only in sed it's a bit heavy for that)
sed -e '1h;1!h;$!d' -e 'x;s/[^(]//g' yourfile | wc -c
using awk
awk -F '(' '{ Total += NF - 1 } END { print Total }' YourFile

Seletively extract number from file name

I have a list of files in the format as: AA13_11BB, CC290_23DD, EE92_34RR. I need to extract only the numbers after the _ character, not the ones before. For those three file names, I would like to get 11, 23, 34 as output and after each extraction, store the number into a variable.
I'm very new to bash and regex. Currently, from AA13_11BB, I am able to either obtain 13_11:
for imgs in $DIR; do
LEVEL=$(echo $imgs | egrep -o [_0-9]+);
done
or two separate numbers 13 and 11:
LEVEL=$(echo $imgs | egrep -o [0-9]+)
May I please have some advice how to obtain my desired output? Thank you!

Use egrep with sed:
LEVEL=$(echo $imgs | egrep -o '_[0-9]+' | sed 's/_//' )

To complement the existing helpful answers, using the core of hjpotter92's answer:
The following processes all filenames in $DIR in a single command and reads all extracted tokens into array:
IFS=$'\n' read -d '' -ra levels < \
<(printf '%s\n' "$DIR"/* | egrep -o '_[0-9]+' | sed 's/_//')
IFS=$'\n' read -d '' -ra levels splits the input into lines and stores them as elements of array ${levels[#]}.
<(...) is a process substitution that allows the output from a command to act as an (ephemeral) input file.
printf '%s\n' "$DIR"/* uses pathname expansion to output each filename on its own line.
egrep -o '_[0-9]+' | sed 's/_//' is the same as in hjpotter92's answer - it works equally on multiple input lines, as is the case here.
To process the extracted tokens later, use:
for level in "${levels[#]}"; do
echo "$level" # work with $level
done

You can do it in one sed using the regex .*_([0-9]+).* (escape it properly for sed):
sed "s/.*_\([0-9]\+\).*/\1/" <<< "AA13_11BB"
It replaces the line with the first captured group (the sub-regex inside the ()), outputting:
11
In your script:
LEVEL=$(sed "s/.*_\([0-9]\+\).*/\1/" <<< $imgs)
Update: as suggested by #mklement0, in both BSD sed and GNU sed you can shorten the command using the -E parameter:
LEVEL=$(sed -E "s/.*_([0-9]+).*/\1/" <<< $imgs)

Using grep with -P flag
for imgs in $DIR
do
LEVEL=$(echo $imgs | grep -Po '(?<=_)[0-9]{2}')
echo $LEVEL
done

Why is this grep filter slow?

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.
Without the one-letter exclusion it runs extremely fast:
time cat /usr/share/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.227s
user 0m0.375s
sys 0m0.021s
grepping on '..', however, is painfully slow:
time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 1m16.319s
user 1m0.694s
sys 0m10.225s
What's going on here?

The problem is the UTF-8 Locale, easy workaround for 100x speedup
What's really slow on the Mac is the UTF-8 locale.
Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.
This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.

I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v, and throw away all one-character lines:
$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.086s
user 0m0.090s
sys 0m0.000s

This might run a little better and would also get rid of your cut needing another pipe.
cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

it might even be faster if you cut down on the use of excessive pipes and useless cat
$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

uniq treats lines as equal when they are not - uniq

There you go - echo -e "あいうえお\nオエウイア" | uni2ascii -q | uniq -c | ascii2uni

Related

grep command to find out how many times any character is followed by '.'

Using math and grep

grep not counting characters accurately when they are clearly in file

Seletively extract number from file name

Why is this grep filter slow?

Categories

Resources