remove duplicate links from a file - regex

I have a big file with links something like this:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link1.html#XYZ
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index1.php?index=2
http://bbbb.com/web1/index1.php?index=3
http://bbbb.com/web1/index1.php?index=4
http://bbbb.com/web1/index1.php?index=5
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
I want to remove all duplicate links and remain with:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
How can i do this ?

Could you please try following.
awk -F'[#?]' '!a[$1]++' Input_file
Explanation of above code:
awk -F'[#?]' ' ##Starting awk script from here and making field separator as #(hash) and ?(literal character) as per OP sample Input_file provided.
!a[$1]++ ##Creating an array whose index is $1(first field of current line). Checking condition if $1 is NOT present in a then increase its value with 1.
##And ! condition will make sure each lines $1 should come only 1 time in array a so by doing this all duplicates will NOT be printed.
' Input_file ##Mentioning Input_file name here.

I hope this will clear all the duplicate links from your file but there should be exactly similar values.
sort -u your-link-file.txt
And if you want to store it to another file then use this
cat your-link-file.txt | sort -u > result.txt

Related

Is there a way to use regex with awk to execute a command only when the pattern is matched?

I'm trying to write a bash script that gets user input, checks a .txt for the line that contains that input then plugs that into a wget statement to commence a download.
In testing the functionality awk seems to print out every line, not just pattern matched lines.
chosen=DSC01985
awk -v c="$chosen" 'BEGIN {FS="/"; /c/}
{print $8, "found", c}
END{print " done"}' ./imgLink.txt
The above should take from imgLink.txt, search for the pattern and return that the pattern is found. Instead it prints the the 8th field of every line in the file.
I have tried moving /c/ out of the begin statement but to no avail.
what's going on here?
Example input:
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01533.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01536.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01543.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01558.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01565.jpg
etc.
Example output:
...
DSC02028.jpg found DSC01985
DSC02030.jpg found DSC01985
DSC02032.jpg found DSC01985
DSC02038.jpg found DSC01985
DSC02042.jpg found DSC01985
etc.
You were close in your attempt, you can't search an awk variable like /var/ you need different method for this. Could you please try following.Considering that your string which you want to look will come in URL value(s) which you have currently xxxed in your post.
awk -v c="$chosen" -F'/' '$0 ~ c{print $NF " found " c}' Input_file
Not sure why you have written done in your END block, you could add it here if you need it. Also $NF means last field of current line you could print it as per your need too.

How to grep specific letters in a sequence using grep

I have a file containing this form of information:
>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold2|size68281
ATAGAGATGAGACAGATGACAGANNNNAGATAGATAGAGCAGATAGACANNNNAGATAGAG
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS
and so on
But I guess there is something abnormal in the sequence , so what I want to is to grep all the lettres that are not A, C, T, G or N in all the lines after scaffold
(I want to search just in the lines where the sequence is not in the line >scaffold-size ).
In the example above it will grep YYYYYYYYYYYYYYYYYY after scaffold3 and SSSSSSSSSSSSS in scaffold 4.
I hope I'm clear enough , please if you need any clarification tell me.
Thank you
Could you please try following, considering that you don't want empty lines then try following.
awk '!/^>/{gsub(/[ACTGN]/,"");if(NF){print}}' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line does not starts from > then do following.
gsub(/[ACTGN]/,"") ##Globally substituting A,C,T,G,N will NULL in lines here.
if(NF){ ##Checking condition if current is NOT NULL after substitution then do following.
print ##Print the current line.
}
}
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS
Let's assume you don't just need to know which sequences contain invalid characters - you also want to know which scaffold each sequence belongs to. This can be done; how to do it depends on the exact output format you need, and also on the exact structure of the data.
Just for illustration, I will make the following simplifying assumptions: the "sequences" may only contain uppercase letters (which may be the valid ones or invalid ones - but there can't be punctuation marks, or digits, etc.); and the labels (the rows that begin with a >) don't contain any uppercase letters. Note - if the sequences only contain letters, then it's not too hard to pre-process the file to convert the sequences to all-uppercase and the labels to all-lowercase, so the solution below will still work.
In some versions of GREP, the invalid characters will appear in a different color (see the linked image). I find this quite helpful.
grep --no-group-separator -B 1 '[BDEFHIJKLMOPQRSUVWXYZ]' input_file
OUTPUT:
>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS
use grep -v to remove the scaffold lines, and use grep -oP to select the segments of undesired letters.
cat test.txt | grep -v '^>' | grep -oP '[^ACGTN]+'
output from the sample data
S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

shell script : how to avoid null character when i read values from a file

i have many values in a file and in one line i am getting a null value. null value is coming as the last value in that file. I want to avoid null value and take the last integer value from the file. Can anyone help me with the Regular expression for doing that. i can post what i tried here.
value=`cat $working_tmp_dir/numbers.txt|tail -3| head -1|cut -f2 -d'='|cut -b 1-8`
when i tried the above i am not getting last integer value.. its giving me null.
sample values in the files are:
date=11052015
date=11062015
date=11092015
date=11122015
date=11192015
date=12172015
date=20160202
date="null value coming here"
the space in between numbers are just format issue.
Please help me with that.
This awk command should work:
awk -F= '$2+0 == $2{n=$2} END{print n}' file
20160202
$2+0 == $2 will be true only if $2 represents a number.
You could just parse your file with awk to get the last line that matches the pattern.
Based on your sample output update you could this if the line begin with date=
value=$(awk '/date=[0-9]/{a=$0}END{print a}' $working_tmp_dir/numbers.txt | grep -oP "\d+")
This will find lines that start with a number and set it to a variable a it will do this for each match and as a result the last match will be the final value set to the variable.
Also you could do this:
value=$(tail -n2 $working_tmp_dir/numbers.txt | grep -m1 -oP "(?<=date=)\d+$")
So based on the sample input this would set the variable value to 20160202.
Any reason why the following isn't sufficient?
egrep '^[0-9]+$' your_file | tail -1
You simply filter the lines with just integers on them with grep and pick the last one with tail.
Thanks for all the comments.. i tried everything which you guys has posted and also did some changes myself and learned a lot things from you guys.. thank you so much . atlast i tried this and i got the solution
value=`grep -Po "=\d+" $working_tmp_dir/numbers.txt |tail -1| sed 's/=//'`
this is perfectly working for me....