Basic grep/sed/awk script to find duplicates - regex

I'm starting out with regular expressions and grep and I want to find out how to do this. I have this list:
1. 12493 6530
2. 12475 5462
3. 12441 5450
4. 12413 5258
5. 12478 4454
6. 12416 3859
7. 12480 3761
8. 12390 3746
9. 12487 3741
10. 12476 3557
...
And I want to get the contents of the middle column only (so NF==2 in awk?). The delimiter here is a space.
I then want to find which numbers are there more than once (duplicates). How would I go about doing that? Thank you, I'm a beginner.

Using awk :
awk '{count[$2]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file
But you don't have duplicate numbers in the 2nd column.
the second column in awk is $2
count[$2]++ increment an array value with the treated number as key
the END block is executed #the end, and we test each array values to find those having +1
And with a better concision (credits for jthill)
awk '++count[$2]==2{print $2}' file

Using perl:
perl -anE '$h{$F[1]}++; END{ say for grep $h{$_} > 1, keys %h }'
Iterate the lines and build a hash (%h/$h{...}) with the count (++) of the second column values ($F[1]), and after that (END{ ... }) say all hash keys with count ($h{$_}) which is > 1.

With the data stored in test,
Using a combination of awk, uniq and grep commands
cat test | awk -v x=2 '{print $x}' | sort | uniq -c | sed '/^1 /d' | awk -v x=2 '{print $x}'
Explanation:
awk -v x=2 '{print $x}'
selects 2nd column
uniq -c
counts the appearance of each number
sed '/^1 /d'
deletes all the entries with only one appearance
awk -v x=2 '{print $x}'
removes the number count with awk again

Related

count of extracted text for each number

I have a text file with lot of SQL queries those look something like this...
select * from sometable where customernos like '%67890%';
select name, city from sometable where customernos like '%67890%';
select * from othertable where customernos like '%12345%';
I can get the count using a command like this...
grep -v 67890 file.txt | wc -l
But is there any way I can get the count of all customer numbers report like...
12345 1
67890 2
Could you please try following.
awk '
match($0,/%[^%][0-9]{5}/){
val[substr($0,RSTART+1,RLENGTH-1)]++
}
END{
for(i in val){
print i,val[i]
}
}' Input_file
For shown samples output will be as follows.
12345 1
67890 2
Explanation: Adding explanation for above.
awk ' ##Starting awk program from here.
match($0,/%[^%][0-9]{5}/){ ##Using match function to match from % to till 5 digits before next occurrence of % here.
val[substr($0,RSTART+1,RLENGTH-1)]++ ##Creating val with index of sub-string of matched regex above.
}
END{ ##Starting END block of this program from here.
for(i in val){ ##Traversing through val here.
print i,val[i] ##Printing value of i and value of array val with index i here.
}
}' Input_file ##Mentioning Input_file name here.
This might work for you (GNU grep,sort,uniq and awk):
grep -Eo '\b[0-9]{5}\b' file | sort -n | uniq -c | awk '{print $2,$1}'
Find 5 digit numbers, sort them, filter and count them and then reverse the columns.
Just for fun, here is a sed solution:
sed -nE 'H;$!d;x;s/[^0-9]/ /g;s/ +/ /g;
:a;x;s/.*/1/;x;tb;
:b;s/^(( \S+\b).*)\2\b/\1/;Tc;x;s/.*/expr & + 1/e;x;tb;
:c;G;s/^ (\S+)(.*)\n(.*)/\1 \3\n\2/;/^[0-9]{5} /P;s/.*\n//;/\S/ba' file
Slurp the file into memory.
Space separate numbers.
Reduce multiple occurrences of the first number to one and count the occurrences.
Print the first number and its occurrences if it fits the criteria.
Repeat with all other numbers.

Add delimiters at specific indexes

I want to add a delimiter in some indexes for each line of a file.
I have a file with data:
10100100010000
20200200020000
And I know the offset of each column (2, 5 and 9)
With this sed command: sed 's/\(.\{2\}\)/&,/;s/\(.\{6\}\)/&,/;s/\(.\{11\}\)/&,/' myFile
I get the expected output:
10,100,1000,10000
20,200,2000,20000
but with a large number of columns (~200) and rows (300k) is really slow.
Is there an efficient alternative?
1st solution: With GNU awk could you please try following:
awk -v OFS="," '{$1=$1}1' FIELDWIDTHS="2 3 4 5" Input_file
2nd Solution: Using sed try following.
sed 's/\(..\)\(...\)\(....\)\(.....\)/\1,\2,\3,\4/' Input_file
3rd solution: awk solution using substr.
awk 'BEGIN{OFS=","} {print substr($0,1,2) OFS substr($0,3,3) OFS substr($0,6,4) OFS substr($0,10,5)}' Input_file
In above substr solution, I have taken 5 digits/characters in substr($0,10,5) in case you want to take all characters/digits etc starting from 10th position use substr($0,10) which will take rest of all line's characters/digits here to print.
Output will be as follows.
10,100,1000,10000
20,200,2000,20000
Modifying your sed command to make it add all the separators in one shot would likely make it perform better :
sed 's/^\(.\{2\}\)\(.\{3\}\)\(.\{4\}\)/\1,\2,\3,/' myFile
Or with extended regular expression:
sed -E 's/(.{2})(.{3})(.{4})/\1,\2,\3,/' myFile
Output:
10,100,1000,10000
20,200,2000,20000
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='2 3 4 *' -v OFS=',' '{$1=$1}1' file
10,100,1000,10000
20,200,2000,20000
You'll need a newer version of gawk for * at the end of FIELDWIDTHS to mean "whatever's left", with older version just choose a large number like 999.
If you start the substitutions from the back, you can use the number flag to s to specify which occurrence of any character you'd like to append a comma to:
$ sed 's/./&,/9;s/./&,/5;s/./&,/2' myFile
10,100,1000,10000
20,200,2000,20000
You could automate that a bit further by building the command with a printf statement:
printf -v cmd 's/./&,/%d;' 9 5 2
sed "$cmd" myFile
or even wrap that in a little shell function so we don't have to care about listing the columns in reverse order:
gencmd() {
local arr
# Sort arguments in descending order
IFS=$'\n' arr=($(sort -nr <<< "$*"))
printf 's/./&,/%d;' "${arr[#]}"
}
sed "$(gencmd 2 5 9)" myFile

Printing Both Matching and Non-Matching Patterns

I am trying to compare two files to then return one of the files columns upon a match. The code that I am using right now is excluding non-matching patterns and just printed out matching patterns. I need to print all results, both matching and non-matching, using grep.
File 1:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
File 2:
F
A
B
Z
C
P
E
Current Result:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
Expected Result:
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Bash Code:
while IFS=',' read point lat lon; do
check=`grep "${point} /home/aaron/file2 | awk '{print $1}'`
echo "${check},${lat},${lon}"
done < /home/aaron/file1
In awk:
$ awk -F, 'NR==FNR{a[$1]=$0;next}{print ($1 in a?a[$1]:$1)}' file1 file2
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Explained:
$ awk -F, ' # field separator to ,
NR==FNR { # file1
a[$1]=$0 # hash record to a, use field 1 as key
next
}
{
print ($1 in a?a[$1]:$1) # print match if found, else nonmatch
}
' file1 file2
If you don't care about order, there's a join binary in GNU coreutils that does just what you need :
$sort file1 > sortedFile1
$sort file2 > sortedFile2
$join -t, -a 2 sortedFile1 sortedFile2
A,42.4,-72.2
B,47.2,-75.9
C,41.7,-95.2
E
F
P
Z,38.3,-70.7
It relies on files being sorted and will not work otherwise.
Now will you please get out of my /home/ ?
another join based solution preserving the order
f() { nl -nln -s, -w1 "$1" | sort -t, -k2; }; join -t, -j2 -a2 <(f file1) <(f file2) |
sort -t, -k2 |
cut -d, -f2 --complement
F
A,42.4,-72.2,2
B,47.2,-75.9,3
Z,38.3,-70.7,4
C,41.7,-95.2,5
P
E
Cannot beat the awk solution but another alternative utilizing unix toolchain based on decorate-undecorate pattern.
Problems with your current solution:
1. You are missing a double-quote in grep "${point} /home/aaron/file2.
2. You should start with the other file for printing all lines in that file
while IFS=',' read point; do
echo "${point}$(grep "${point}" /home/aaron/file1 | sed 's/[^,]*,/,/')"
done < /home/aaron/file2
3. The grep can give more than one result. Which one do you want (head -1) ?
An improvement would be
while IFS=',' read point; do
echo "${point}$(grep "^${point}," /home/aaron/file1 | sed -n '1s/[^,]*,/,/p')"
done < /home/aaron/file2
4. Using while is the wrong approach.
For small files it wil get the work done, but you will get stuck with larger files. The reason is that you will call grep for each line in file2, reading file1 a lot of times.
Better is using awk or some other solution.
Another solution is using sed with the output of another sed command:
sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1
This will give commands for the second sed.
sed -f <(sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1) /home/aaron/file2

delimiter inside reqular expression Awk

I have a term like aa-and-bb in the 10th column of a tab limited file, file.tsv.
I can get aa-and-bb as
cat file.tsv | awk 'BEGIN{FS="\t"};{print $10}'
How do I further get aa from aa-and-bb?
You can use split().
split( $10, arr, "-" ); print arr[ 1 ];
If you can guarantee no other -s in fields 1-9, you can add - as a separator:
awk -F'\t|-' '{print $10}'
I am guessing that all three terms, aa, and, and bb are variable, and you want only the first term.
cat file.tsv | awk 'BEGIN{FS="\t"};{print $10}' | sed 's/-.*$//'
$ awk -F'\t' '{sub(/-.*$/, "", $10);print $10}' file.tsv
aa
But it is not 100% clear how your data looks, so we are just guessing here that you want to split on the dash.

Extract substring from rows with regex and remove rows with duplicate substring

I have a text file with some rows in the following form
*,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
I would like to remove duplicate rows that have the same value for * (case insensitive), ie anything left of ,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
For example here's a sample text file
test,bar,log,dog,0,0,0
one
foo,bar,log,dog,0,0,0
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
The resulting text file should have the duplicate foo removed (order does not matter to me so long as the duplicates are removed, leaving 1 unique)
test,bar,log,dog,0,0,0
one
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
What's the simplest bash command I could do to achieve this?
awk -F, '!seen[tolower($1)]++' file
You can do this with awk like this (since you don't care which of the duplicates gets kept):
awk -F, '{lines[tolower($1)]=$0}END{for (l in lines) print lines[l]}'
If you wanted to keep the first instead:
awk -F, '{if (lines[tolower($1)]!=1) { print; lines[tolower($1)]=1 } }'
Search for
(?:(?<=\n)|^)(.*)((?:,(?:d|l|fr)og){2}(?:,[01]){3})(?=\n)([\s\S]*)(?<=\n).*\2(?:\n|$)
...and replace with
$1$2$3
#!/bin/bash
for line in $(cat $1)
do
key=$( echo ${line%%,*} | awk '{print tolower($0)}')
found=0
for k in ${keys[#]} ; do [[ "$k" == "$key" ]] && found=1 && break ; done
(( found )) && continue
echo $line
keys=( "${keys[#]}" "$key" )
done
Using an array instead of an association (hash), which is less performant. But it seems to work.
This might work for you (GNU sed):
cat -n file |
sort -fk2,2 |
sed -r ':a;$!N;s/^.{7}([^,]*),[^,]*(,(d|l|fr)og){2}(,[01]){3}\n(.{7}\1,[^,]*(,(d|l|fr)og){2}(,[01]){3})$/\5/i;ta;P;D' |
sort -n |
sed -r 's/^.{7}//'
Number each line.
Sort by the first key (ignoring case)
Remove duplicates (based on specific criteria)
Sort reduced file back into original order
Remove line numbers