Extract unique lines by pattern - regex

wish to ask bash regex question. I need to print unique lines from the list.
This list contains emails and some of them repeated many times and also some of them have same id and password but different mail accounts.
The list looks as follows:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
firstman#yahoo.com:pass1234
thirdman#cox.net:mypas345
someguy#mail.com:onepass789
firstman# someguy# repeated 2 times but with other mail providers.
I need to get following output:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
thirdman#cox.net:mypas345
uniq -u do this job just partly - it compares full line, instead i need to compare strings outside of #emailprovider: pattern.
How to "discard" this pattern while extracting unique lines ?

With AWK you can say:
awk -F'[#:]' '!seen[$1,$3]++' inputlist
yields:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
thirdman#cox.net:mypas345
-F'[#:]' sets the field separator to either "#" or ":".
Then $1 holds the string before "#" and $3 does after ":".
The condition '!seen[$1,$3]++' tells AWK to print the line if the $1,$3 entry is not seen.

If you flip the fields around you can use --skip-fields=1 (or -f 1) to only consider the emails for uniqueness.

You can use the following awk command instead:
awk -F# '!s[$1]{s[$1]=1;print}' filename

Related

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix
With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile
another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.
Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Trim, from a list of email addresses, all addresses matching a "forbidden domain" list

I have a list of email addresses (in a text file, one address per line):
u1#d1.com
u2#d1.com
u3#d1.com
u1#d2.com
u1#d3.com
u1#d4.com
u2#d4.com
I also have a list of domains (in a text file, one domain per line):
d1.com
d2.com
I am trying to write two bash scripts:
One that will return a list that excludes any email address that matches ANY ONE of the domains in the second list (I will consider those ones the "good" ones)
One that will return a list with ONLY the email addresses that match ANY ONE of the domains in the second list (I will delete users from my site who belong to those addresses)
What's the best, easiest way to get this done? I am rusty with bash, and am finding it tricky. The regular expression is basic.
Please note that I am not after complete solutions, but "key commands" to make this happen.
Use grep command, like:
grep -f allowed_domains emails
to get the allowed emails, where "allowed_domains" is the second file you show in the question, "emails" is the first one. . Add "-v" for the not allowed emails.
If you want something stronger, add a "#" at start of each allowed_domain line. By example, as:
cat allowed_domains | xargs -L1 printf "#%s\n" | grep -f - emails
You can use this awk command:
awk -F# 'NR==FNR{dom[$0]; next} {print > (($2 in dom)? "bad.txt":"good.txt")}' file2 file1
cat good.txt
u1#d3.com
u1#d4.com
u2#d4.com
cat bad.txt
u1#d1.com
u2#d1.com
u3#d1.com
u1#d2.com

how to grep exact string match across 2 files

I've UTF-8 plain text lists of usernames, 1 per line, in list1.txt and list2.txt. Note, in case pertinent, that usernames may contain regex characters e.g. ! ^ . ( and such as well as spaces.
I want to get and save to matches.txt a list of all unique values occurring in both lists. I've little command line expertise but this almost gets me there:
grep -Ff list1.txt list2.txt > matches.txt
...but that is treating "jdoe" and "jdoe III" as a match, returning "jdoe III" as the matched value. This is incorrect for the task. I need the per-line pattern match to be the whole line, i.e. from ^ to $. I've tried adding the -x flag but that gets no matches at all (edit: see comment to accepted answer - I got the flag order wrong).
I'm on OS X 10.9.5 and I don't have to use grep - another command line (tool) solving the problem will do.
All you need to do is add the -x flag to your grep query:
grep -Fxf list1.txt list2.txt > matches.txt
The -x flag will restrict matches to full line matches (each PATTERN becomes ^PATTERN$). I'm not sure why your attempt at -x failed. Maybe you put it after the -f, which must be immediately followed by the first file?
This awk will be handy than grep here:
awk 'FNR==NR{a[$0]; next} $0 in a' list1.txt list2.txt > matches.txt
$0 is the line, FNR is the current line number of the current file, NR is the overall line number (they are only the same when you are on the first file). a[$0] is a associative array (hash) whose key is the line. next will ensure that further clauses (the $0 in a) will not run if the current clause (the fact that this is the first file) did. $0 in a will be true when the current line has a value in the array a, thus only lines present in both will be displayed. The order will be their order of occurence in the second file.
A very simple and straightforward way to do it that doesn't require one to do all sorts of crazy things with grep is as follows
cat list1.txt list2.txt|grep match > matches.txt
Not only that, but it's also easier to remember, (especially if you regularly use cat).
grep -Fwf file1 file2 would match word to word !!

Extracting a string between two patterns in bash with a regex

I have a string with key/value pairs in a bash variable. The value I want is hidden like this.
{"keyIDontCareAbout"=>"valueIDontCareAbout",
"keyForValueIWant"=>"valueIWant",
...............bunch more keys
}
What should I use to extract that value? sed, awk, expr match?
My thinking is this, I should extract the string that is preceded by "keyForValueIWant"=>" and is followed by " but I'm having a hard time deciding which tool to use.
expr match seems bad, because it grabs a string at the end of an expression or at the beginning of one, but my string is in the middle of a bunch of characters.
Basically, I can't figure out the regex syntax for a substring between two other substrings.
You can use the following sed command:
valueOfInterest=$(sed -n '/keyForValueIWant/ s/.*=>"\([^"]*\).*/\1/p' <<< "$input")
-n disables output by default. The regex /keyForValueIWant/ restricts the following action only to the/those lines which match the regex. The following substitute command filters the value out of the line and prints it /p.
Try awk as follows:
# Specify key of interest.
key='keyForValueIWant'
# Extract matching value, assuming that the input data is
# in shell variable $input:
value=$(awk -F'("|=>)' -v key="$key" '$2==key { print $5; exit }' <<<"$input")
# Print result.
echo "Value for $key: [$value]"
-F'("|=>)' tells awk to split each line into fields based on " or => as separators - effectively, this will put the the key in field 2 ($2), and the value in field 5 ($5)
The key of interest is passed as a shell variable ($key) to awk as a variable of the same name (-v key=...).
If the input line's key matches the specified key ($2==key), the 5th field - containing the value - is printed (print $5).
exit ensures that processing stops once a match is found to prevent unnecessary parsing of the remainder of the file (note: this assumes that the keys are true keys, i.e., that they are unique in the input file).