Remove unnecessary information from SNP rsid names - regex

I have a dataset of SNPs, which aren't coded the way I need them. Instead of being coded just "rsNUMBER" they also have the information of the chip-analyses. For example: GSA-rsNUMBER or psy-rsNUMBER
Some also have the information of the chip-analyses at the end rsNUMBER_CNV_SULT1A3 .
Is there a way to remove the chip-information? My data is in plink binary format .bed, .bim, and .fam.

You can use Perl to get a simple hack working:
echo -e "1 rs123-bob 0 123456 N N\n1 bob-rs123 0 123456 N N\n" | perl -p -e "s/(\S+\s+)\S*(rs[0-9]+)\S*(.*)/\1\2\3/g;
Above assumes .bim format.

Related

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.
Something along the lines of RS=\d+ should do the trick for the expression you're looking for.
Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log
Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

iterate over apache 2 log file names and compare numbers using linux bash

Here is an example of logs in my /var/www/apache2/log folder-
./no_domain_access.log.7.gz
./no_domain_access.log.8.gz
./no_domain_access.log.9.gz
./no_domain_error.log.10.gz
./no_domain_error.log.11.gz
./no_domain_error.log.12.gz
./no_domain_error.log.13.gz
./no_domain_error.log.14.gz
./no_domain_error.log.15.gz
./no_domain_error.log.16.gz
./no_domain_error.log.17.gz
./no_domain_error.log.18.gz
./no_domain_error.log.19.gz
./no_domain_error.log.20.gz
and goes until 50...
I would like to iterate over those files and remove all log files that are greater then 5.
using regex syntax will give me the option to match numbers in the pattern of [1-9] or {1,2} but this will also match that log files that i dont want to delete ( single numbers 1-5 log files that i wish to keep)
How can i match only file names with numbers higher than 5 ?
Thanks!
You can use awk one-liner for this:
printf '%s\n' *[0-9].gz | awk -F '.' '$(NF-1) >= 5'
This awk command uses dot as field separator and compared $(NF-1) (that is the numeric field before extension) with number 5.
To delete these files use:
printf '%s\n' *[0-9].gz | awk -F '.' '$(NF-1) >= 5' | xargs rm
xargs takes input from awk and rm command just deletes those files.
Use the bash, regex operator ~ to extract the number and list the file if the number was greater than 5
for file in /var/www/apache2/log/*.gz; do
test -f "$file" || continue
[[ $file =~ ^.*log\.([[:digit:]]+).*$ ]] && { (( "${BASH_REMATCH[1]}" > 5 )) && printf "%s\n" "$file"; }
done
If you just want to delete the files, replace printf "%s\n" by just rm.
Find with regular expressions
find . -regex './no_domain_access.log.*gz' ! -regex './no_domain_access.log.[1-5].gz'
Find all files matching no_domain... and then run another regular expression to attain all these results minus files with 1 to 5.
Without regular expressions, using shell globs and entirely native & portable POSIX shell code:
rm -f no_domain_access.log.[6-9].gz no_domain_access.log.[0-9][0-9].gz
It's easier in bash:
rm -f no_domain_access.log.{6..50}.gz
These are probably created with logrotate or a similar log rotation utility. You might want to just change its configuration to only store five logs.
If it's controlled by logrotate, you can find the documentation with man logrotate and you'll probably find something like this:
/var/log/no_domain_access.log {
rotate 50
daily
}
Change the 50 to 5 and you're done. You probably(?) still have to clean up the current old logs using one of the above commands.

How to extract values between 2 known strings

I have some huge files containing mixed binary and xml data. I want to extract all values between 2 XML tags that have multiple occurrences in the file. Pattern would be as following: <C99><F1>050</F1><F2>random value</F2></C99> . Portions of XML data are not formatted, everything is in a single line.
I need all values between <F1> and </F1> from <C99> where value is between range 050 and 999(<F1> exists under other fields as well but I need only values of F1 from C99). I need to count them, to see how many C99 have F1 with values between 050 and 999.
I want a hint how I could easily reach and extract that values (using cat and grep? or sed?). Sorting and counting is easy to do it once values are exported in a file.
My temporary solution:
After removing all binary data from the file, I can run the following command:
cat filename | grep -o "<C99><F1>......." > file.txt
This will export first 12 characters from all strings starting with <C99><F1>.
<C99><F1>001
<C99><F1>056
<C99><F1>123
<C99><F1>445
.....
Once exported in a text file, I replace <C99><F1> with nothing and then I sort and count remaining values.
Thank you!
Using XMLStarlet:
$ xml sel -t -v '//C99/F1[. >= 50 and . <= 999]' -nl data.xml | wc -l
Not much of a hint there, sorry.

How to find lines using patterns in a file in UNIX

I am trying to use a .txt file with around 5000 patterns (spaced with a line) to search through another file of 18000 lines for any matches. So far I've tried every form of grep and awk I can find on the internet and it's still not working, so I am completely stumped.
Here's some text from each file.
Pattern.txt
rs2622590
rs925489
rs2798334
rs6801957
rs6801957
rs13137008
rs3807989
rs10850409
rs2798269
rs549182
There's no extra spaces or anything.
File.txt
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
rs12562034 1 758311 A G -1.552 0.1207 0.09167
rs4040617 1 769185 A G -0.414 0.6786 0.875
rs4970383 1 828418 A C 0.214 0.8303 .
rs4475691 1 836671 T C -0.604 0.5461 .
rs1806509 1 843817 A C -0.262 0.7933 .
The file.txt was downloaded directly from a med directory.
I'm pretty new to UNIX so any help would be amazing!
Sorry edit: I have definitely tried every single thing you guys are recommending and the result is blank. Am I maybe missing a syntax issue or something in my text files?
P.P.S I know there are matches as doing individual greps works. I'll move this question to unix.stackexchange. Thanks for your answers guys I'll try them all out.
Issue solved: I was obviously using DOS carriages. I didn't know about this before so thank you everyone that answered. For future users who are having this issue, here is the solution that worked:
dos2unix *
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt > Output.txt
You can use grep -Fw here:
grep -Fw -f Pattern.txt File.txt
Options used are:
-F - Fixed string search to tread input as non-regex
-w - Match full words only
-f file - Read pattern from a file
idk if it's what you want or not, but this will print every line from File.txt whose first field equals a string from Patterns.txt:
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt
If that is not what you want, tell us what you do want. If it is what you want but doesn't produce the output you expect then one or both of your files contains control characters courtesy of being created in Windows so run dos2unix or similar on them both first.
Use a shell script to read each line of the file containing your patterns then fgrep it.
#!/bin/bash
FILENAME=$1
awk '{kount++;print $0}' $FILENAME | fgrep -f - PATTERNFILE.txt

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555