Detecting duplicate music files

Detecting duplicate music files - mp3

I've got two directories containing ~20 GB of music files (mostly mp3, some ogg), and I would like to detect all duplicate songs. There are two complicating factors:
A song may have different filenames in the two directories.
Two files containing the same song may have different ID3 tags and thus have different checksums.
What is a good approach to solving this?

The way I have gone about this in the past is to use genpuids that come from Music IP. The closed source software creates an audio fingerprint of a file regardless of format, id3, checksum etc.
More information can be found here.
This should ensure the most amount of positive duplicate matches and minimize false positives. It can also correctly tag incorrect id3 tags.

Here's what I would do (or have done before)...
Load all songs onto itunes (bear with me)
(note, if you can use itunes here, then stop ... I assume your list of dupes is long and unmanageable)
Delete all songs, sending them to the trash can, this way you get rid of the directory structure
Obviously, don't "empty trash". Rescue the songs to a folder on your desktop
Use software like mediamonkey, dupe eliminator or even itunes itself to identify the duplicates. Dupe eliminator is good in that it checks by a varying amount of factors, artist, length, filesize and whatnot and guesses what is a dupe and what isn't)
Reload onto Itunes, this time around check "Auto arrange songs", which will drop your new, dupeless list onto a nice by-artist-by-album arrangement
... voila! (or if you read digg: "...profit!")
/mp

If you have a library that can parse the files, you can run the hash on the audio data. This will not help you if the song is a different rip or has be recompressed/transcoded/etc.

Are the ID3/OGG-equiv artist and song metatags accurate? If they are, you could use those.
Edit: If they're not, perhaps they could be made to be... If you're only dealing with whole albums, there are several tools that will get all the tag data based on the number of tracks and their lengths.
If you're dealing with mixes of albums and single files, it gets more complicated.

I'm sure there's more elegant solutions out there - but if the audio data is equivalent, then stripping the ID3 tags and hashing should do the trick. After hashing, you can put the ID3 tags back if you like.

Perhaps the Last.fm API would be useful. It includes a track.getInfo call which returns XML including the track's length, artist name, track number, etc. You could compare tracks and see if they have more than N fields equal and if so, assume they're the same track.
I have no idea about whether they're going to be OK with you submitting API requests for 40gb of music, though.

How about something like this: find a library to get the mp3's length as well as a pointer to the audio data (looks like there are a couple libraries out there that can do this), do a first pass filter based on song lengths, and for the songs that have matching lengths checksum their audio data. Similar to this script for finding duplicate files / images.

Some adaptation of ffTES has worked great for me for a very similar task.

I was faced with the same problem, so I wrote a command-line program that tries to detect similar audio files by comparing acoustic fingerprints: https://github.com/derat/soundalike
It uses the fpcalc utility from Chromaprint to generate the fingerprints, and then builds a lookup table to find possible matches before comparing fingerprints more rigorously.
It worked pretty well when I ran against my music library, but there are various flags to tune its behavior if needed. If it works for you (or if it doesn't), let me know!

Related

Not sure what i'm looking for (matrix, db-like structure) to organize files by tags

i was messing around organizing my music files when i asked myself why windows nor linux offer a way to organize a folder by custom tags in a database-likle manner rather than hierarchically.
The problem i wanted to solve is the following:
I have music files
A titled "tempest" from Beethoven, classical music in a piano only version.
B titled "whatever" from Mozart, classical music orchestral
D titled "one winged angel" from Uematsu, classical style, game ost, orchestral
C titled "one winged angel" same as before, violin only, cover from Taylor Davis.
And whatever "main" information i use for grouping, makes listing files by any other category immpossible.
Hence i whished to save files in an hidden folder with a simple increasing number.format, and have a program in which i can add files, add categories, search by tags, and end up with a list of the files i want. E.g. today i want to listen to all piano only pieces independently of their composer-time period.
I started making a structure of vectors containing vectors (aka matrix) but indexing lines and column by string started getting complicated when i want to remove a column.
And searching files by tag would require me to have each tag as an object knowing all files that use it, and it starts becomming more similar to a 3d matrix.
I though it would be better to think of this as a database, started with sqllite but ended with the problem of being unable to remove columns (i know i can create a copy etcc, but i wanted to avoid messy workarounds).
Also an sql-like database wouldn't allow me to have an area dedicated to a list of random tags for each file without a definite category.
Is there any existing library that rather then working as an sql database offers me something similar to a search/insert optimized matrix for strings? I don't think i was the first one thinking about that, someone must have done something similar.
This is very similar to what i want to achieve (strictly speaking about functionality), but rather than having only a bunch of random tags, i'd like to have some categories AND a set of random tags.
The problem with random tags only is you can't use the same word when it refers to different things. For example if the title of a piece is A and there's a film named A with a piece titled B, filtering A in the mess of tags would give both, while with categories i could filter pieces titled A. But the random mess of additional tags without category is useful too, for information you don't want to fill in most of the files and that would take pointless space in a standard database.

How to read/restore big data file (SEGY format) with C/C++?

I am working on a project which needs to deal with large seismic data of SEGY format (from several GB to TB). This data represents the 3D underground structure.
Data structure is like:
1st tract, 2,3,5,3,5,....,6
2nd tract, 5,6,5,3,2,....,3
3rd tract, 7,4,5,3,1,....,8
...
What I want to ask is, in order to read and deal with the data fast, do I have to convert the data into another form? Or it's better to read from the original SEGY file? And is there any existing C package to do that?

If you need to access it multiple times and
if you need to access it randomly and
if you need to access it fast
then load it to a database once.
Do not reinvent the wheel.

When dealing of data of that size, you may not want to convert it into another form unless you have to - though some software does do just that. I found a list of free geophysics software on Wikipedia that look promising; many are open source and read/write SEGY files.
Since you are a newbie to programming, you may want to consider if the Python library segpy suits your needs rather than a C/C++ option.

Several GB is rathe medium, if we are toking about poststack.
You may use segy and convert on the fly, you may invent your own format. It depends whot you needed to do. Without changing segy format it's enough to createing indexes to traces. If segy is saved as inlines - it's faster access throug inlines, although crossline access is not very bad.
If it is 3d seismic, the best way to have the same quick access to all inlines/crosslines is to have own format - based od beans, e.g 8x8 traces - loading all beans and selecting tarces access time may be very quick - 2-3 secends. Or you may use SSD disk, or 2,5x RAM as your SEGY.
To quickly access timeslices you have 2 ways - 3D beans or second file stored as timeslices (the quickes way). I did same kind of that 10 years ago - access time to 12 GB SEGY was acceptable - 2-3 seconds in all 3 directions.
SEGY in database? Wow ... ;)

The answer depends upon the type of data you need to extract from the SEG-Y file.
If you need to extract only the headers (Text header, Binary header, Extended Textual File headers and Trace headers) then they can be easily extracted from the SEG-Y file by opening the file as binary and extracting relevant information from the respective locations as mentioned in the data exchange formats (rev2). The extraction might depend upon the type of data (Post-stack or Pre-stack). Also some headers might require conversions from one format to another (e.g Text Headers are mostly encoded in EBCDIC format). The complete details about the byte locations and encoding formats can be read from the above documentation
The extraction of trace data is a bit tricky and depends upon various factors like the encoding, whether the no. of trace samples is mentioned in the trace headers, etc. A careful reading of the documentation and getting to know about the type of SEG data you are working on will surely make this task a lot easier.
Since you are working with the extracted data, I would recommend to use already existing libraries (segpy: one of the best python library I came across). There are also numerous free available SEG-Y readers, a very nice list has already been mentioned by Daniel Waechter; you can choose any one of them that suits your requirements and the type file format supported.
I recently tried to do something same using C++ (Although it has only been tested on post-stack data). The project can be found here.

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.

See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.

You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.

This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).

Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.

The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.

Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.

From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.

I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.

get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js