Finding Duplicates (Regex) - regex

I have a CSV containing list of 500 members with their phone numbers. I tried diff tools but none can seem to find duplicates.
Can I use regex to find duplicate rows by members' phone numbers?
I'm using Textmate on Mac.
Many thanks

What duplicates are you searching for? The whole lines or just the same phone number?
If it is the whole line, then try this:
sort phonelist.txt | uniq -c | sort -n
and you will see at the bottom all lines, that occur more than once.
If it is just the phone number in some column, then use this:
awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n
replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.
Or give us a few example lines from this file.
EDIT:
If the data format is: name,mobile,phone,uniqueid,group, then use the following:
awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n
in the command line.

Yes. For one way to do it, look here. But you would probably not want to do it this way.

You can normally parse this file, and check what rows are duplicated. I think RAGEX is a worst solution for this problem.

What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.
Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.

use PERL.
Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:
my %seen;
my #unique = grep !$seen{$_}++, #array2;
After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.

Related

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.
Something along the lines of RS=\d+ should do the trick for the expression you're looking for.
Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log
Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Extracting Last column using VI

I have a csv file which contains some 1000 fields with values, the headers are something like below:
v1,v2,v3,v4,v5....v1000
I want to extract the last column i.e. v1000 and its values.
I tried %s/,[^,]*$// , but this turns out to be exact opposite of what i expected, Is there any way to invert this expression in VI ?
I know it can be done using awk as awk -F "," '{print $NF}' myfile.csv, but i want to make it happen in VI with regular expression,Please also note that i have VI and don't have VIM and working on UNIX, so i can't do visual mode trick as well.
Many thanks in advance, Any help is much appreciated.
Don't you just want
%s/.*,\s*//
.*, is match everything unto the last comma and the \s* is there to remove whitespace if its there.
You already accepted answer, btw you can still use awk or other nice UNIX tools within VI or VIM. Technique below calls manipulating the contents of a buffer through an external command :!{cmd}
As a demo, let's rearrange the records in CSV file with sort command:
first,last,email
john,smith,john#example.com
jane,doe,jane#example.com
:2,$!sort -t',' -k2
-k2 flag will sort the records by second field.
Extract last column with awk as easy as:
:%!awk -F "," '{print $NF}'
Dont forget cut!
:%!cut -d , -f 6
Where 6 is the number of the last field.
Or if you don't want to count the number of fields:
:%!rev | cut -d , -f 1 | rev

How can I write data from txt file to database?

If I have a txt with a certain number of rows and column (number of columns unknown at the beginning, columns are separated by tab), how can I export the data into the database? I have managed to iterate through the first row to count the number of columns and create a table accordingly but now I need to go through each row and insert the data into the respective column. How can I do that?
Example of the txt file:
Name Size Population GDP
aa 2344 1234 12
bb 2121 3232 15
... ... .. ..
.. .. .. ..
The table has been created:
CREATE TABLE random id INT, Name char(20), Size INT, Population INT, GDP INT
The difficult part is reading in the text fields. According to your definition, the field titles are separated by spaces. Is this true for the text fields?
A generic process is:
Create an SQL CREATE statement from the header text.
Execute the SQL statement.
While reading a line of text doesn't fail do
Parse the text into variables.
Create an SQL INSERT statement using field names and values from the variables.
Execute the SQL statement.
End-While
Another solution is to convert the TXT file into tab or comma separated fields. Check your database documentation to see if there is a function for loading files and also discover the characters used for separating columns.
If you need specific help, please ask a more specific or detailed question.
Using PostgreSQL's COPY, command, something like:
COPY random FROM 'filename' WITH DELIMITER '\t'
something like this might work.
basic idea is to use print statements to transform the line into SQL commannds.
then you can execute these commands using a sql command interpreter.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random (Name,size,population,gdp) VALUES (" $0 ");" }' > sqlcommands.txt
for the unknown number of columns, this might work.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random VALUES (ID," $0 ");" }' > sqlcommands.txt
replace ID with the id value needed. but you will need to execute it separately for each ID value.
I work with Sybase where "bcp" utility does this. Quick google on "postgres bcp" brings up this:
http://lists.plug.phoenix.az.us/pipermail/plug-devel/2000-October/000103.html
I realize its not the best answer, but good enough to get you going, I hope.
Oh, and you may need to change your text format, make it comma or tab-delimited. Use sed for that.