Is there a way to build an easy related posts app in django - django

It seems to by my nightmare for the last 4 weeks,
I can't come up with a solution for a "related posts" app in django/python in which it takes the users input and comes out with a related post that matches closely with the original input. I've tried using like statements but it seems that they are not sensitive enough.
Such as which i need typos to also be taken into consideration.
is there a library that could save me from all my pain and suffering?

Well, I suppose there are a few different ways to normalize the user input to produce desirable results (although I'm not sure to what extent libraries exist for them). One of the easiest ways to get related posts would be to compare the tags present on that post (granted your posts have tags). If you wanted to go another route, I would take the following steps: remove stop words from the subject, use some kind of stemmer on the remainder, and finally treat the remaining words as "tags" to compare with other posts. For the sake of efficiency, it would probably be a good idea to run these steps in a batch process on all of your current posts and store off the resulting "tags." As far as typos, I'm sure there are a multitude of spelling corrector libraries exist (I found this one after a few seconds with Google).

Related

How to get most similar words to a document in gensim doc2vec?

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.
For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.
Now I want to find the most relevant words to "doc_about_java".
I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:
docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)
I also tried like this:
print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])
but it didn't change anything. How can I find the most similar words to a given document?
Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.
So that may explain why the results of your initial attempt to get a list-of-related-words seem random.
To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).
(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)
(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)

Cycling through URLs to download csv files

I have a list of URLs which will access a webservice. The webservice downloads .csv files. I want to be able to cycle through them using a date field which is in a specific format in the url itself, thereby downloading the data day-by-day. The access seems fairly slow as even a manual url entry can take a couple of minutes to complete, and I suspect the issue is at the webservice' end.
The URL is in the format:
http://web.service.com/ws/XYZ/data/?key=mysecretkeyf&field1=X&start=YYYY-MM-DD 00:00&end= YYYY-MM-DD 00:00&field=Y&format=csv
So the way I envision it (and I am keen to take advice) is using a variable for the start year, month and day fields cycling onto the next URL as the previous .csv is downloaded, with the code ending when the current date is accessed.
Any ideas most welcome.
The coding is really straightforward which makes me wonder if you are looking to code this yourself or asking if there is a service out there that would help do this. If you are coding, I'd choose a language that works well for you. #Vivek mentioned Python which is what I would choose as well.
If you do not want to go the coding approach, you could check out DownThemAll. I have used this utility for batch downloading where you have to increment numbers in parts of the URL. Check it out, it may be a good non-programming solution: DownThemAll

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

solr PatternReplaceCharFilterFactory working unexpectedly

I am relatively new to Solr so please forgive me if I'm missing something obvious. I have an application that allows users to search for musical artists. The indexing comes from a read-only database with correct spellings so on the index side I have it figured out.
On the query side however I need to anticipate various spelling errors/differences and want to help solr find those instances. From our old home-grown search solution, I have a list of regex's and the artists they apply to. When I was trying to translate those to solr using the PatternReplaceCharFilterFactory, I noticed that some worked perfectly, while others didn't work at all ... with seeming no rhyme nor reason between them.
For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="em[ei]n[ei]m" replacement="Eminem"/>
accurately captures the common misspellings of Eminem. But for the band 311:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Tt]hree [Ee]leven" replacement="311"/>
Does not work. Another example is Nine Inch Nails:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((nine|9).*inch.*nails\b)|(n\.? ?i\.? ?n\.?\b)" replacement="Nine Inch Nails"/>
works perfectly for finding the most common patterns for the band's name. But for Eve 6:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Ee]ve.{0,4}([Ss]ix|6)" replacement="Eve 6"/>
Is there something fundamental I'm missing on the usage of this filter? I've tried a number of variations on the regex's I've mentioned above (even going so far as using literals like 'three eleven'), but still with no success. I've tried making the filter in question the only PatternReplaceCharFilterFactory in the analyzer. I also know for sure that these items are in the index correctly because when I search for the correct spelling it returns the proper results.
Any suggestions?
Snowdall
I suspect the problem is not with your Char Factory, but with what comes after all, specifically the tokenizer. If you use standard tokenizer, it will get rid of the numbers you have just put into your stream. If you don't need the text to be split into tokens, you could look at KeywordTokenizerFactory instead.
In general, the best way to troubleshoot this in Solr 4+ is the Analysis screen in the Admin WebUI. It allows you to enter your text against particular field type and see what happens with it after each component in the analysis chain.
I would recommend using the SynonymFilter for the kind of application you describe. It allows you to provide an external file where you list words and their synonyms, like:
eminem <=> emenem
nine <=> 9
If you precede this with a LowerCaseFilter, you won't have to fuss about case normalization in your synonyms. You should be able to handle the 311 case too as long as you don't tokenize (ie use a KeywordTokenizer as Alexander Rafalovitch suggested).

Detecting duplicate music files

I've got two directories containing ~20 GB of music files (mostly mp3, some ogg), and I would like to detect all duplicate songs. There are two complicating factors:
A song may have different filenames in the two directories.
Two files containing the same song may have different ID3 tags and thus have different checksums.
What is a good approach to solving this?
The way I have gone about this in the past is to use genpuids that come from Music IP. The closed source software creates an audio fingerprint of a file regardless of format, id3, checksum etc.
More information can be found here.
This should ensure the most amount of positive duplicate matches and minimize false positives. It can also correctly tag incorrect id3 tags.
Here's what I would do (or have done before)...
Load all songs onto itunes (bear with me)
(note, if you can use itunes here, then stop ... I assume your list of dupes is long and unmanageable)
Delete all songs, sending them to the trash can, this way you get rid of the directory structure
Obviously, don't "empty trash". Rescue the songs to a folder on your desktop
Use software like mediamonkey, dupe eliminator or even itunes itself to identify the duplicates. Dupe eliminator is good in that it checks by a varying amount of factors, artist, length, filesize and whatnot and guesses what is a dupe and what isn't)
Reload onto Itunes, this time around check "Auto arrange songs", which will drop your new, dupeless list onto a nice by-artist-by-album arrangement
... voila! (or if you read digg: "...profit!")
/mp
If you have a library that can parse the files, you can run the hash on the audio data. This will not help you if the song is a different rip or has be recompressed/transcoded/etc.
Are the ID3/OGG-equiv artist and song metatags accurate? If they are, you could use those.
Edit: If they're not, perhaps they could be made to be... If you're only dealing with whole albums, there are several tools that will get all the tag data based on the number of tracks and their lengths.
If you're dealing with mixes of albums and single files, it gets more complicated.
I'm sure there's more elegant solutions out there - but if the audio data is equivalent, then stripping the ID3 tags and hashing should do the trick. After hashing, you can put the ID3 tags back if you like.
Perhaps the Last.fm API would be useful. It includes a track.getInfo call which returns XML including the track's length, artist name, track number, etc. You could compare tracks and see if they have more than N fields equal and if so, assume they're the same track.
I have no idea about whether they're going to be OK with you submitting API requests for 40gb of music, though.
How about something like this: find a library to get the mp3's length as well as a pointer to the audio data (looks like there are a couple libraries out there that can do this), do a first pass filter based on song lengths, and for the songs that have matching lengths checksum their audio data. Similar to this script for finding duplicate files / images.
Some adaptation of ffTES has worked great for me for a very similar task.
I was faced with the same problem, so I wrote a command-line program that tries to detect similar audio files by comparing acoustic fingerprints: https://github.com/derat/soundalike
It uses the fpcalc utility from Chromaprint to generate the fingerprints, and then builds a lookup table to find possible matches before comparing fingerprints more rigorously.
It worked pretty well when I ran against my music library, but there are various flags to tune its behavior if needed. If it works for you (or if it doesn't), let me know!