How to extract values between 2 known strings - regex

I have some huge files containing mixed binary and xml data. I want to extract all values between 2 XML tags that have multiple occurrences in the file. Pattern would be as following: <C99><F1>050</F1><F2>random value</F2></C99> . Portions of XML data are not formatted, everything is in a single line.
I need all values between <F1> and </F1> from <C99> where value is between range 050 and 999(<F1> exists under other fields as well but I need only values of F1 from C99). I need to count them, to see how many C99 have F1 with values between 050 and 999.
I want a hint how I could easily reach and extract that values (using cat and grep? or sed?). Sorting and counting is easy to do it once values are exported in a file.
My temporary solution:
After removing all binary data from the file, I can run the following command:
cat filename | grep -o "<C99><F1>......." > file.txt
This will export first 12 characters from all strings starting with <C99><F1>.
<C99><F1>001
<C99><F1>056
<C99><F1>123
<C99><F1>445
.....
Once exported in a text file, I replace <C99><F1> with nothing and then I sort and count remaining values.
Thank you!

Using XMLStarlet:
$ xml sel -t -v '//C99/F1[. >= 50 and . <= 999]' -nl data.xml | wc -l
Not much of a hint there, sorry.

Related

Remove unnecessary information from SNP rsid names

I have a dataset of SNPs, which aren't coded the way I need them. Instead of being coded just "rsNUMBER" they also have the information of the chip-analyses. For example: GSA-rsNUMBER or psy-rsNUMBER
Some also have the information of the chip-analyses at the end rsNUMBER_CNV_SULT1A3 .
Is there a way to remove the chip-information? My data is in plink binary format .bed, .bim, and .fam.
You can use Perl to get a simple hack working:
echo -e "1 rs123-bob 0 123456 N N\n1 bob-rs123 0 123456 N N\n" | perl -p -e "s/(\S+\s+)\S*(rs[0-9]+)\S*(.*)/\1\2\3/g;
Above assumes .bim format.

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.
Something along the lines of RS=\d+ should do the trick for the expression you're looking for.
Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log
Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

grep regular expression not matching zero correctly

I'm having some issues with grep regular expressions. I'm trying to grep some ascii coded hexidecimal data where the characters are all lower case
My grep statement is as follows:
grep -E "01[a-f0-9]{2}81[a-f0-9]0" log.log
Most of the matches in the file look ok, except there are numerous matches that are as follows
010481ec070000
01b481ec070000
01508129070521
I can't work out why these strings are matching. They should not match because 81 must be followed by a hex character then a zero.
I have done some further investigation. If I place these three strings in a separate file, and grep that file. I get no matches. Not quite sure what is going on here.
This is grep 2.12.
here is part of the raw data in the file. These are all lines that have matched. And still match after exporting LC_ALL=C
input data : 011a81a907000b3002004070eaa3d2240fa81272011763dd0040002001
input data : 010481e1070000
input data : 010481ea070000
input data : 011a81a207000b980f0040681f2b11d2f60202dc003669ba0140006100
input data : 014681ab07002140010040d2e457f8c00494ed5e014362bf0240006101ae0500404ee311f402feb2165401c562450240005801db08044068f09ff6a6005af953008062470640004d01
input data : 010481e3070000
input data : 013081ac070016c0000040f6d963fcb4f7e8127c0103637b0140006f01bf0200408ae344fdd2043eed72018362a30240006f01
input data : 010481e4070000
input data : 011a81ad07000b5c06006064f96804901154fed2008e66ff0f4000a401
input data : 010481e5070000
input data : 014681ae070021170d004069f196134cf6a805b4000769b6034000be014e0e004092e80820da0b82fbfa000c6c5c014000bf01880a004020d9ce21f4efd40954011469a1004000ae01
input data : 011a81a607000bef0d0060d60dd6edf8f18e104e015b63d3014000da00
input data : 011a81af07000b4c0800401cfbb0184a0c28f7fa00516931024000e101
input data : 015c81a007002c12050020f2ff640028007afd00801205f70540000400280c00404f016a0a10fbd0012a00e769ff0f400018005d020040e3fabd21e00830f4d200c769d80140000300030a004042030
Try executing it with the environment variable LC_ALL=C. The locale affects the way grep interprets character ranges.
Assuming that the command is exactly as you say... The quotes are right, there's no filename glob going on before grep gets the arguments, you don't have {0} instead of 0, etc....
I wonder if the -a (treat binary file as text) is the culprit. Binary output could be processed by the terminal. (That's how we change colors or do curses positioning or whatnot.)
What if you had binary in there that erased part of the line? Say control-H's...
What happens if you pipe the grep output through od -c (or perhaps od -a or od -t a if you have it).
What happens if you store the output in a file, pull out just one such line with grep, and look at it with od?

How can I write data from txt file to database?

If I have a txt with a certain number of rows and column (number of columns unknown at the beginning, columns are separated by tab), how can I export the data into the database? I have managed to iterate through the first row to count the number of columns and create a table accordingly but now I need to go through each row and insert the data into the respective column. How can I do that?
Example of the txt file:
Name Size Population GDP
aa 2344 1234 12
bb 2121 3232 15
... ... .. ..
.. .. .. ..
The table has been created:
CREATE TABLE random id INT, Name char(20), Size INT, Population INT, GDP INT
The difficult part is reading in the text fields. According to your definition, the field titles are separated by spaces. Is this true for the text fields?
A generic process is:
Create an SQL CREATE statement from the header text.
Execute the SQL statement.
While reading a line of text doesn't fail do
Parse the text into variables.
Create an SQL INSERT statement using field names and values from the variables.
Execute the SQL statement.
End-While
Another solution is to convert the TXT file into tab or comma separated fields. Check your database documentation to see if there is a function for loading files and also discover the characters used for separating columns.
If you need specific help, please ask a more specific or detailed question.
Using PostgreSQL's COPY, command, something like:
COPY random FROM 'filename' WITH DELIMITER '\t'
something like this might work.
basic idea is to use print statements to transform the line into SQL commannds.
then you can execute these commands using a sql command interpreter.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random (Name,size,population,gdp) VALUES (" $0 ");" }' > sqlcommands.txt
for the unknown number of columns, this might work.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random VALUES (ID," $0 ");" }' > sqlcommands.txt
replace ID with the id value needed. but you will need to execute it separately for each ID value.
I work with Sybase where "bcp" utility does this. Quick google on "postgres bcp" brings up this:
http://lists.plug.phoenix.az.us/pipermail/plug-devel/2000-October/000103.html
I realize its not the best answer, but good enough to get you going, I hope.
Oh, and you may need to change your text format, make it comma or tab-delimited. Use sed for that.

Finding Duplicates (Regex)

I have a CSV containing list of 500 members with their phone numbers. I tried diff tools but none can seem to find duplicates.
Can I use regex to find duplicate rows by members' phone numbers?
I'm using Textmate on Mac.
Many thanks
What duplicates are you searching for? The whole lines or just the same phone number?
If it is the whole line, then try this:
sort phonelist.txt | uniq -c | sort -n
and you will see at the bottom all lines, that occur more than once.
If it is just the phone number in some column, then use this:
awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n
replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.
Or give us a few example lines from this file.
EDIT:
If the data format is: name,mobile,phone,uniqueid,group, then use the following:
awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n
in the command line.
Yes. For one way to do it, look here. But you would probably not want to do it this way.
You can normally parse this file, and check what rows are duplicated. I think RAGEX is a worst solution for this problem.
What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.
Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.
use PERL.
Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:
my %seen;
my #unique = grep !$seen{$_}++, #array2;
After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.