NGS Analysis: How can I extract the different variants in two VCF files - compare

I am currently working on my thesis and I am trying to analyze the results of NGS sequencing Illumina. I am not really familiar with bioinformatics and in this part of my project, I am trying two compare two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specifically I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis? If you can help me I would be more than thankful. Thank you in advance!

I understand your problem. First thing I would recommend is to use Unix software (I don't know which OS you're running) called VCFtools. It's pretty simple to use. But if You want to do all the processing with, for example python, you can use the pandas library for python which helps to process data in column format or PyVCF library, which is a parser for VCF files. I can help you more if you can provide some example data you're processing.

Related

Converting RTF tables into xml

I was given the task to convert great amount of RTF tables into XML ones (around or way more than 100.000), but I have no idea how to even start it and i cannot get help from the lead developer, because ironically he had never written a line of code.
I was thinking about c++ as I need t to be fast, but I'm open to any ideas.
What I need is some information I can start the project with or any library/program I could use for my help, thank you.
EDIT: I have XSD schemas to work with.
Found the solution after looking for a while. I can use LibreOffice to save it as html or other various forms that will keep the table as it is and also give a clear code i can pull an XSD on to make it valid also.

Implementing Data Frames in OCaml

I have been learning OCaml on my own and I've been really impressed with the language. I wanted to develop a small machine learning library for practice but I've been presented with a problem.
In Python one can use Pandas to load data files then pass it to a library like Scikit-Learn very easily. I would like to emulate the same process in OCaml. However, there doesn't seem to be any data frames library in OCaml. I've checked 'ocaml-csv' but it doesn't really seem to be doing what I want. I also looked into 'Frames' from Haskell but it uses TemplateHaskell but I believe a simpler way to do the same thing should be possible if Pandas can simply load the data file into memory without compile-time metaprogramming.
Does anyone know how data frames are implemented in Pandas or R, a quick search on Google doesn't seem to return useful links.
Is it possible to use a parser generator such as Menhir to parse CSV files? Also, I'm unsure how static typing works with data frames.
Would you have a reference about the format of data frames? It may not be so hard to add to ocaml-csv if CSV is the underlying representation. The better is to open an issue with a request and the needed information.

How to write to, edit, and retrieve specific cells from an Excel doc with C++?

Basically, I want to be to be able to pass data between Excel cells and
my C++ program. I don't have any experience in Excel/C++ interactions and I haven't been able to find a coherent explanation or documentation on any websites. If someone could link me some references or provide one themselves it would be much appreciated. Thanks.
If this is for a Windows system, you could always use one of the available managed Excel libraries, such as OfficeWriter or Aspose.
There also might be similar libraries specifically for c++, I know we (OfficeWriter) used to make one.
Edit: Looks like there are a few out there, like LibXL and BasicExcel.
If the application will run on an end user machine with Excel installed, you can easily use the Excel interop and keep Excel hidden.
In addition to LibXL and BasicExcel mentioned by smoore, there is:
ExcelFormat Library is an improved version of the BasicExcel library and will allow you to read and write simple values. It is free.
xlslib will also read and write simple values, I have not tried it tho. It is also free.
Number Duck, is a commercial library that I have written, It supports reading and writing values, formulas and pictures. The website has examples of how to use the features.

How to Show C++ Results in Excel

I am trying to create C++ code that allows User Input in selecting a variety of fields, then it will calculate many different angles and show the users the results, as well as a graph.
However, it has been suggested to us by our lecturer that it may be a good idea to write the code to these calculations etc in C++, then input the results into Excel.
Does anyone have any idea how to do this? Literally looking for a way for the user to fill in the required values on C++ and then to be AUTOMATICALLY taken to the excel file to show the results in the table and graph format.
If this is not possible, is there a way to display the results in the table and graph format through C++?
Thanks very much in advance
Excel provides COM interface which you can use from your C++ application.
This can be done in the way described in this article:
http://support.microsoft.com/kb/216686
This link might also be useful:
http://www.codeproject.com/Articles/10886/How-to-use-Managed-C-to-Automate-Excel
I think the second link would be better for you as its more of a step by step guide which should help you to workout the answer.
Use COM Automation to automate excel.
The best way to do this is to use the vole library by Matthew Wilson at
http://vole.sourceforge.net/
Take a look at the examples. I do not think there is an example for excel, but there is one for microsoft word at http://www.codeproject.com/KB/COM/VOLE_word.aspx
I have used vole in the past, and it makes it a whole lot easier

library for doing diffs

I've been tasked with creating a tool that can diff and merge the configuration files for my company's product. The configurations are stored as either XML or URL-encoded strings. I'm looking for a library, preferably open source with a license compatible with commercial software, that can do these diffs. Our app is written in C++, so C++ libraries would be best, but I'm willing to look at libraries that are C#-specific since I can write a wrapper that exposes it to C++ via COM. Three-way diffs would be ideal, but two-way is acceptable. If it has an understanding of XML, that would also be a plus (since XML nodes can be reordered without changing the document, etc). Any library suggestions? Should I even consider writing my own diff tools in the hopes of giving it semantic knowledge of our formats?
Thanks to this similar question, I've already discovered this google library, which seems really great, but I'm still looking for other options. It also seems to be able to output the diffs in HTML format (using the <ins> and <del> tags that I didn't know existed before I discovered it), which could be really handy, but it seems to be a unified diff only. I'm going to need to display the results in a web browser, and probably have to build an interface for doing the merges in the browser as well. I don't expect a library to be able to help with these tasks, but it must produce output in a format that is amenable to me building this on top of it. I'm currently envisioning something along the lines of TortoiseMerge (side-by-side diffs, not unified), except browser-based. Any tips/tricks/design ideas on how to present this would be appreciated too.
Subversion comes with libsvn_diff and libsvn_delta licensed under Apache Software License.
Here is a C++ library that can diff what the author calls semistructured data. It deals nicely with HTML and XML. Since your data is XML it would make a lot of sense to use this instead of plain text diff. This is especially the case when the files are machine generated.
I am currently trying to use this library to build a tool that diffs Visual Studio project files. These are basically XML files and using a plain diff tool like Winmerge is too painful because Visual Studio pretty much mucks up the whole file by crazy reordering. The idea is to do some kind of a structured diff to address the problem.
For diffing the XML I would propose that you normalize it first: sort all the elements in alphabetic order, then generate a stream of tokens/xml that represents the original document but is independent of the original formatting. After running the diff, parse the result to get a tree containing what was added / removed.