Extract a string from vcf file - regex

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.

Something along the lines of RS=\d+ should do the trick for the expression you're looking for.

Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log

Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Related

Regex a string with unknown number of parameters

Let say I have millions of string in a text file in this format:
st=expand&c=22&t=button&k=fun HTTP
This is a string we can look at as a hash with keys st, c, t and k. Some of the strings in the text file might not have a given &KEY=VALUE present and might thus look like this:
st=expand&k=fun HTTP
How would one use sed to change the string to following
expand,,,fun
that is, even thought the key=value isn't present, we still add a comma. We can assume that we have a fixed key set [st,c,t,k].
What I've tried is something like (just an idea!!)
sed 's/\(st=\|c=\|t=\|k=\)\([\(^\&\|HTTP\)])\(\&\|HTTP\)/\3,/g' big_file
but obviously, if c isn't there, it isn't adding a comma since it doesn't find any. Any ideas how to approach this? Using awk might also be acceptable (or any other fast text-processing utility)
Thanks!
Input data example
st=expand&c=22&t=button&k=fun HTTP
c=22&t=button&k=fun HTTP
st=expand&c=22&t=party&k=fun HTTP
st=expand&c=22&k=fun HTTP
st=expand HTTP
HTTP
Output data
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
You can use this sed:
sed -E 's/(st=([^& ]*)|)(.*c=([^& ]*)|)(.*t=([^& ]*)|)(.*k=([^& ]*)|) HTTP/\2,\4,\6,\8/' file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Sed Demo
RegEx Demo
Whenever you have name=value pairs in your input data, it's simplest and clearest and usually most efficient to create a name->value array and then print the values by name in whatever order you want, e.g.:
$ cat tst.awk
BEGIN { FS="[&= ]"; OFS="," }
{
delete n
for (i=1;i<NF;i+=2) {
n[$i] = $(i+1)
}
print n["st"], n["c"], n["t"], n["k"]
}
$ awk -f tst.awk file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Another pattern for sed to try:
sed -r "s/(st=(\w+))?(&?c=(\w+))?(&t=(\w+))?(&k=(\w+))?( HTTP)/\2,\4,\6,\8/g" big_file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
REGEX 101 DEMO
How about something like this? It's not perfectly strict, but as long as your data follows the format you described on every line, it will work.
Regex:
^(?:st=([^&\n]*))?&?(?:c=([^&\n]*))?&?(?:t=([^&\n]*))?&?(?:k=([^&\n]*))? HTTP$ (must be run once per line or with multi-line and global options enabled)
Substitution:
\1,\2,\3,\4
Try it here: https://regex101.com/r/nE1oP7/2
EDIT: If you are using sed, you will need to change the non-capturing groups to regular ones ((?:) to ()) and update the backreferences accordingly (\2,\4,\6,\8). Demo: http://ideone.com/GNRNGp

Match fixed string + numbers 0-10 with grep

I have a list of files such as this:
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob1_R1.fastq.gz
Sample_lane1-Bob2_R1.fastq.gz
Sample_lane1-Bob4_R1.fastq.gz
Sample_lane1-Bob5_R1.fastq.gz
Sample_lane1-Bob7_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob1_R1.fastq.gz
Sample_lane2-Bob3_R1.fastq.gz
Sample_lane2-Bob4_R1.fastq.gz
Sample_lane2-Bob6_R1.fastq.gz
Sample_lane2-Bob7_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
Sample_lane3-Bob11_R1.fastq.gz
Sample_lane3-Bob12_R1.fastq.gz
Sample_lane3-Bob13_R1.fastq.gz
Sample_lane3-Bob15_R1.fastq.gz
Sample_lane3-Bob16_R1.fastq.gz
Sample_lane3-Bob18_R1.fastq.gz
Sample_lane3-Bob19_R1.fastq.gz
Sample_lane3-Bob20_R1.fastq.gz
Sample_lane5-Bob11_R1.fastq.gz
Sample_lane5-Bob12_R1.fastq.gz
Sample_lane5-Bob16_R1.fastq.gz
Sample_lane5-Bob17_R1.fastq.gz
Sample_lane5-Bob19_R1.fastq.gz
Sample_lane5-Bob20_R1.fastq.gz
Sample_lane8-Sample1_R1.fastq.gz
Sample_lane8-Sample2_R1.fastq.gz
Sample_lane8-Sample3_R1.fastq.gz
Sample_lane8-Sample4_R1.fastq.gz
Sample_lane8-Sample5_R1.fastq.gz
I want to return only the files that are labeled 'Bob1' through 'Bob10' in order to perform some downstream actions, and I want to return the files labeled 'Bob11' through 'Bob20' similarly.
I have been trying to use grep for this with a regular expression, but have not been able to match both 'Bob' and the adjacent numeric range. For example, this is one of the many lines that have not worked:
grep -E "Bob#([10|0-9])"
I have tried many different combinations of Bob, 10|0-9, ", (), and [] in different places based on different tutorials I have found online but none have worked so far.
EDIT: For completeness, this solution given by #anubhava solved the above question:
grep -E "Bob(10|[0-9])_"
I did not specifically ask for the regex to return the other half of the range, 'Bob11'-'Bob20', but came up with this solution for it as per this page:
grep -E "Bob([1-2][1-9])_"
You can use this regex for grep against a file:
grep -E "Bob(10|[0-9])_" file
However if you are using glob pattern in a directory then use this extended glob:
shopt -s extglob
printf "%s\n" *Bob#(10|[[:digit:]])_*
Output:
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob1_R1.fastq.gz
Sample_lane1-Bob2_R1.fastq.gz
Sample_lane1-Bob4_R1.fastq.gz
Sample_lane1-Bob5_R1.fastq.gz
Sample_lane1-Bob7_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob1_R1.fastq.gz
Sample_lane2-Bob3_R1.fastq.gz
Sample_lane2-Bob4_R1.fastq.gz
Sample_lane2-Bob6_R1.fastq.gz
Sample_lane2-Bob7_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
If you use a tool that can do math instead of relying on a regexp then you can select any range you like:
$ awk -F'-Bob|_' '$3+0>7 && $3+0<13' file
Sample_lane1-Bob10_R1.fastq.gz
Sample_lane1-Bob8_R1.fastq.gz
Sample_lane1-Bob9_R1.fastq.gz
Sample_lane2-Bob10_R1.fastq.gz
Sample_lane2-Bob8_R1.fastq.gz
Sample_lane2-Bob9_R1.fastq.gz
Sample_lane3-Bob11_R1.fastq.gz
Sample_lane3-Bob12_R1.fastq.gz
Sample_lane5-Bob11_R1.fastq.gz
Sample_lane5-Bob12_R1.fastq.gz

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Finding Duplicates (Regex)

I have a CSV containing list of 500 members with their phone numbers. I tried diff tools but none can seem to find duplicates.
Can I use regex to find duplicate rows by members' phone numbers?
I'm using Textmate on Mac.
Many thanks
What duplicates are you searching for? The whole lines or just the same phone number?
If it is the whole line, then try this:
sort phonelist.txt | uniq -c | sort -n
and you will see at the bottom all lines, that occur more than once.
If it is just the phone number in some column, then use this:
awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n
replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.
Or give us a few example lines from this file.
EDIT:
If the data format is: name,mobile,phone,uniqueid,group, then use the following:
awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n
in the command line.
Yes. For one way to do it, look here. But you would probably not want to do it this way.
You can normally parse this file, and check what rows are duplicated. I think RAGEX is a worst solution for this problem.
What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.
Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.
use PERL.
Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:
my %seen;
my #unique = grep !$seen{$_}++, #array2;
After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.