I am currently working on my thesis and I am trying to analyze the results of NGS sequencing Illumina. I am not really familiar with bioinformatics and in this part of my project, I am trying two compare two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specifically I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis? If you can help me I would be more than thankful. Thank you in advance!
I understand your problem. First thing I would recommend is to use Unix software (I don't know which OS you're running) called VCFtools. It's pretty simple to use. But if You want to do all the processing with, for example python, you can use the pandas library for python which helps to process data in column format or PyVCF library, which is a parser for VCF files. I can help you more if you can provide some example data you're processing.
I need to increment/add/renumber numbers (BibTeX keys) selected using regex over several hundred TeX files, maintaining the sequence from one file to the next, when sorted in alphanumeric order.
Files:
latex-01.tex
latex-02.tex
latex-03.tex
etc
Each file containing something like,
Text ... [bibkey01a] ...
More text [bibkey02] ...
I know it is easily possible to do it on one file. I have found several other similar pages on stackoverflow and other forums, but all deal only with one file at a time.
I could open each file, increment/add/renumber the numbers using TextPastry or Sublime-Evaluate and manually carry over the proper value to the next file and repeat the procedure for all the files.
That is possible but a daunting task when one has several hundred related files that need to have value renumbered in a continuos related way. Also, it would be quite easy to make a mistake and carry over the wrong number.
How to automatically increment/add/renumber numbers in Sublime 3 over many related files in a continuos way?
It seems Sublime 3 + extensions is not able to do what I need at the moment.
Of course I can do it with a script. I believe Emacs can do it too, using helm-swoop and wgrep, then, using a replace expression that contains elisp code.
Hi guys I am learning to program in MPI and I came across this question.
Lets say that the current working directory I have 10 files. Each file contains a column with numbers.
I want to divide the work among all processors, so for example if i use, say, two nodes, i want node 1 to read the first 5 files and the second node do the rest.
Thank you for any help.
There are no metadata operations in MPI-IO aside from open/create a file or deleting that file. I suppose it was hard to standardize over windows, unix, and, I don't know.. vax-y styles back in the old days?
The nice thing about MPI is that it provides a good basis for libraries. Write a "MPI-IO metadata" library... and share it with us!
i've just started learning winapis and c++ programming ..
i was thinking about starting a personal project (to enhance my coding, and to help me understand the winapis better)..
and i've decided to program a "cmd" files renamer, that basically takes :
1)a path
2)a keyword
3)the desiered formate
4)versioned or not (or numbered, like if u had 20 episodes of the same show, u wouldnt wanna
truncate the episode number)..
5)special cases to delete (like when ur downloading a torrent, they have a [309u394] attached to the name.. and most of the time an initial [WE-RIP-TV-SHOWS-HDTV-FANSUBS-GROUPS-ETC]
i am building the logic as follows:
the program takes the path(input 1),
performs a full files indexing.. then it compares the files found against the keyword
example gives (input 2) (use regex?)
Reformat file name step. (input 3, 4, 5);
save file name.
questions:
A) is my logic flow proper? any suggestions to improve it?
B)should i use Regex to check against file name, keyword, and desired format? (not good with regex yet) , i mean is it the best way to perform the huge amount of comparisons ?
Regular expressions should do the trick. Also you could use the Boost library, it has some really neat functions including the regexp, which is probably faster than the functions you'll find around (:
I've got two directories containing ~20 GB of music files (mostly mp3, some ogg), and I would like to detect all duplicate songs. There are two complicating factors:
A song may have different filenames in the two directories.
Two files containing the same song may have different ID3 tags and thus have different checksums.
What is a good approach to solving this?
The way I have gone about this in the past is to use genpuids that come from Music IP. The closed source software creates an audio fingerprint of a file regardless of format, id3, checksum etc.
More information can be found here.
This should ensure the most amount of positive duplicate matches and minimize false positives. It can also correctly tag incorrect id3 tags.
Here's what I would do (or have done before)...
Load all songs onto itunes (bear with me)
(note, if you can use itunes here, then stop ... I assume your list of dupes is long and unmanageable)
Delete all songs, sending them to the trash can, this way you get rid of the directory structure
Obviously, don't "empty trash". Rescue the songs to a folder on your desktop
Use software like mediamonkey, dupe eliminator or even itunes itself to identify the duplicates. Dupe eliminator is good in that it checks by a varying amount of factors, artist, length, filesize and whatnot and guesses what is a dupe and what isn't)
Reload onto Itunes, this time around check "Auto arrange songs", which will drop your new, dupeless list onto a nice by-artist-by-album arrangement
... voila! (or if you read digg: "...profit!")
/mp
If you have a library that can parse the files, you can run the hash on the audio data. This will not help you if the song is a different rip or has be recompressed/transcoded/etc.
Are the ID3/OGG-equiv artist and song metatags accurate? If they are, you could use those.
Edit: If they're not, perhaps they could be made to be... If you're only dealing with whole albums, there are several tools that will get all the tag data based on the number of tracks and their lengths.
If you're dealing with mixes of albums and single files, it gets more complicated.
I'm sure there's more elegant solutions out there - but if the audio data is equivalent, then stripping the ID3 tags and hashing should do the trick. After hashing, you can put the ID3 tags back if you like.
Perhaps the Last.fm API would be useful. It includes a track.getInfo call which returns XML including the track's length, artist name, track number, etc. You could compare tracks and see if they have more than N fields equal and if so, assume they're the same track.
I have no idea about whether they're going to be OK with you submitting API requests for 40gb of music, though.
How about something like this: find a library to get the mp3's length as well as a pointer to the audio data (looks like there are a couple libraries out there that can do this), do a first pass filter based on song lengths, and for the songs that have matching lengths checksum their audio data. Similar to this script for finding duplicate files / images.
Some adaptation of ffTES has worked great for me for a very similar task.
I was faced with the same problem, so I wrote a command-line program that tries to detect similar audio files by comparing acoustic fingerprints: https://github.com/derat/soundalike
It uses the fpcalc utility from Chromaprint to generate the fingerprints, and then builds a lookup table to find possible matches before comparing fingerprints more rigorously.
It worked pretty well when I ran against my music library, but there are various flags to tune its behavior if needed. If it works for you (or if it doesn't), let me know!