To find repeated matches by `uniq -d` - uniq

My data as /tmp/1
9367543
9105616
9108177
8948074
8860323
9170406
9105616
I run and I get nothing
cat /tmp/1 | uniq -d
This is strange, since uniq -d should
-d Only output lines that are repeated in the input.
How can you use uniq -d?

You have to sort your data before you use uniq. It only removes/detects duplicates on adjacent lines.

Try this to double check, it will output any lines which are duplicated:
cat /tmp/1 | awk 'seen[$0]++ == 1'
Oh, this is your problem:
cat /tmp/1 | sort | uniq -d
Sort it before running uniq!

awk '{_[$0]++}END{for(i in _)if(_[i]>1) print i}' /tmp/1
or just
awk '_[$0]++ == 1' file

Related

Grep with regex expression

I need the content between the fourth and fifth "|" on all lines starting with FHEAD. The goal is to apply the regular expression in grep to read files.
I have this expression that returns all content between "|"
(?<=\|)(.*?)(?=\|)
The goal in the example below would be to return
1047
8401-
FHEAD|1|PRMPC|20200217103050|1047|S
TMBPE|FHEAD|2|MOD
FHEAD|3|8401|230008|8401-|8401-Dcto|8401-Dcto 10FHEAD|1|235211|20190206000001|20190402235959|2||1||8||
TPGRP|4|240184
TGLIST|5|235213||||FHEAD
TLITM|6|101029605
TLITM|7|FHEAD101052978
Someone can help me?
Thanks in advance
To print the content of the fifth field (non-empty) on lines starting with FHEAD:
awk -F'|' '$1=="FHEAD" && $5!=""{print $5}' file
awk -F '|' '$5=="1047" || $5=="8401-"{ print $0 }" inputfile.txt
Above will find "1047" or "8401" in the fifth column of the inputfile "inputfile.txt"
grep -E "\|1047\||\|8401-\|" inputfile.txt
Above will do the same with grep (but this will not be restricted to column 5.
EDIT:
I must have missed the 'starting with FHEAD'....
awk -F\| '/^FHEAD/{ print $5 }' inputfile.txt
or with grep
grep -e '^FHEAD|\(.[^|]*|\)\{3\}\(.[^|]*\)' -o inputfile.txt | grep '.[^|]*|*' -o | grep -v '|$'
a combination of grep and cut:
grep -e '^FHEAD' inputfile.txt | cut -d'|' -f 5

grep command to find out how many times any character is followed by '.'

I have to find out how often any character is followed by a period (.) with the help of grep. After finding how many times character is followed by period and then I have to sort the result in ascending order.
For example in this string: "Find my input. Output should be obtained. You need to find output."
The output should be something like this:
d 1
t 2
What I have done so far :
cat filename | grep -o "*." | sort -u
But it is not working as intended.
Any ideas how to solve this? I have to perform this operation on huge library of books in .txt files.
An iterative approach with GNU grep:
grep -o '.\.' filename | sort | uniq -c
Output:
1 d.
2 t.
grep -Po '.(?=\.)' filename | sort | uniq -c
Output:
1 d
2 t
grep -Po '.(?=\.)' filename | sort | uniq -c | awk '{print $2,$1}'
Output:
d 1
t 2
With single GNU awk process:
awk -v FPAT='.[.]' 'BEGIN{ PROCINFO["sorted_in"]="#ind_str_asc" }
{ for(i=1;i<=NF;i++) a[substr($i,1,1)]++ }
END{ for(i in a) print i,a[i] }' filename
The output:
d 1
t 2
This one is ok too
echo "Find my input. Output should be obtained. You need to find output."| grep -o ".\." | sort | uniq -c | rev | tr -d .

Extract filenames that matches the pattern and remove duplicates and store in an array

I would like to know the easiest way to list a part of filenames without any duplication present in a directory.
Example:
A directory has files like this:
Stack1_over_flow.txt
Stack2_exchange.txt
Meta_stack.txt
Stack1_over_flow.txt
Meta_stack.txt
Now I want the result to be:
Stack1
Stack2
Meta
Here, extract the string that occurs before the first occurrence of "_" and remove if any duplication of the string.
ls -1 | awk '{split($0,a,"_"); print a[1]}' | sort -b | uniq
Only files, with find:
find . -maxdepth 1 -type f -printf "%f\n" | awk '{split($0,a,"_"); print a[1]}' | sort -b | uniq
Using sed
ls -l | sed -r 's/([a-zA-Z0-9])_.*/\1/' | uniq
you can even try this
ls -1 | cut -d "_" -f1 | uniq

Basic grep/sed/awk script to find duplicates

I'm starting out with regular expressions and grep and I want to find out how to do this. I have this list:
1. 12493 6530
2. 12475 5462
3. 12441 5450
4. 12413 5258
5. 12478 4454
6. 12416 3859
7. 12480 3761
8. 12390 3746
9. 12487 3741
10. 12476 3557
...
And I want to get the contents of the middle column only (so NF==2 in awk?). The delimiter here is a space.
I then want to find which numbers are there more than once (duplicates). How would I go about doing that? Thank you, I'm a beginner.
Using awk :
awk '{count[$2]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file
But you don't have duplicate numbers in the 2nd column.
the second column in awk is $2
count[$2]++ increment an array value with the treated number as key
the END block is executed #the end, and we test each array values to find those having +1
And with a better concision (credits for jthill)
awk '++count[$2]==2{print $2}' file
Using perl:
perl -anE '$h{$F[1]}++; END{ say for grep $h{$_} > 1, keys %h }'
Iterate the lines and build a hash (%h/$h{...}) with the count (++) of the second column values ($F[1]), and after that (END{ ... }) say all hash keys with count ($h{$_}) which is > 1.
With the data stored in test,
Using a combination of awk, uniq and grep commands
cat test | awk -v x=2 '{print $x}' | sort | uniq -c | sed '/^1 /d' | awk -v x=2 '{print $x}'
Explanation:
awk -v x=2 '{print $x}'
selects 2nd column
uniq -c
counts the appearance of each number
sed '/^1 /d'
deletes all the entries with only one appearance
awk -v x=2 '{print $x}'
removes the number count with awk again

using sed to get only line number of "grep -in"

Which regexp should I use to only get line number from grep -in output?
The usual output is something like this:
241113:keyword
I need to get only "241113" from sed's output.
I suggest cut
grep -in keyword ... | cut -d: -f1
If you insist with sed:
grep -in keyword ... | sed 's/:.*$//g
You don't need to use sed. Cut is enough. Just pipe grep's output to
cut -d ':' -f 1
As an example:
grep -n blabla file.txt | cut -d ':' -f 1
Personally, I like awk
grep -in 'search' file | awk --field-separator : '{print $1}'
As said in other answers, cut is the right tool; but if you really want to use a swiss-army knife, you can also use awk:
grep -in keyword ... | awk -F: '{print $1}'
or using grep again:
grep -in keyword ... | grep -oE '^[0-9]+'
Just in case someone is wondering if all this could be done without grep, i.e. with sed alone ...
echo '
a
b
keyword
c
keyWord
x
y
keyword
Keyword
z
' |
sed -n '/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/{=;}'
#sed -n '/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/{=;q;}' # only line number of first match