How generate random word from real languages - web-services

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.

See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.

You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.

This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

Related

What is the best way to encrypt hardcoded strings in C++?

Warning: C++ noob
I've read multiple posts on StackOverflow about string encryption. By the way, they don't answer my doubts.
I must insert one or two hardcoded strings in my code but I would like to make it difficult to read in plain text when debugging/reverse engineering. That's not all: my strings are URLs, so a simple packet analyzer (Wireshark) can read it.
I've said difficult because I know that, when the code runs, the string is somewhere (in RAM?) decrypted as plain text and somebody can read it. So, assuming that is not possible to completely secure my string, what is the best way of encrypting/decrypting it in C++?
I was thinking of something like this:
//I've omitted all the #include and main stuff of course...
string encryptedUrl = "Ajdu67gGHhbh34590Hb6vfu6gu" //Encrypted url with some known algorithm
URLDownloadToFile(NULL, encryptedUrl.decrypt(), C:\temp.txt, 0, NULL);
What about packet analyzing? I'm sure there's no way to hide the URL but maybe I'm missing something? Thank you and sorry for my worst english!
Edit 1: What my application does?
It's a simple login script. My application downloads a text file from an URL. This file contains an encrypted string that is read using fstream library. The string is then decrypted and used to login on another site. It is very weak, because there's no database, no salt, no hashing. My achievement is to ensure that neither the url nor the login string are "easy" to read from a static analisys of the binary, and possibly as hard as possible with a dynamic analysis (debugging, revers engineering, etc).
If you want to stymie packet inspectors, the bare minimum requirement is to use https with a hard-coded server certificate baked into your app.
There is no panacea for encrypting in-app content. A determined hacker with the right skills will get at the plain url, no matter what you do. The best you can hope for is to make it difficult enough that most people will just give up. The way to achieve this is to implement multiple diverse obfuscation and tripwire techniques. Including, but not limited to:
Store parts of the encrypted url and the password (preferably a one-time key) in different locations and bring them together in code.
Hide the encrypted parts in large strings of randomness that looks indistinguishable from the parts.
Bring the parts together piecemeal. E.g., Concatenate the first and second third of the encrypted url into a single buffer from one initialisation function, concatenate this buffer with the last third in a different unrelated init function, and use the final concatenation in yet another function, all called from different random places in your code.
Detect when the app is running under a debugger and have different functions trash the encrypted contents at different times.
Detection should be done at various call sites using different techniques, not by calling a single "DetectDebug" function or testing a global bool, both of which create a single point of attack.
Don't use obvious names, like, "DecryptUrl" for the relevant functions.
Harvest parts of the key from seemingly unrelated, but consistent sources. E.g., read the clock and only use a few of the high bits (high enough that that they won't change for the foreseeable future, but low enough that they're not all zero), or use a random sampling of non-volatile results from initialisation code.
This is just the tip of the iceberg and will only throw novices off the scent. None of it is going to stop, or even significantly slow down, a skillful attacker, who will simply intercept calls to the SSL library using a stealth debugger. You therefore have to ask yourself:
How much is it worth to me to protect this url, and from what kind of attacker?
Can I somehow change the system design so that I don't need to secure the url?
Try XorSTR [1, 2]. It's what I used to use when trying to hamper static analysis. Most results will come from game cheat forums, there is an html generator too.
However as others have mentioned, getting the strings is still easy for anyone who puts a breakpoint on URLDownloadToFile. However, you will have made their life a bit harder if they are trying to do static analysis.
I am not sure what your URL's do, and what your goal is in all this, but XorStr + anti-debug + packing the binary will stop most amateurs from reverse engineering your application.

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

Testing if a string contains one of several thousand substrings

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.

Detecting duplicate music files

I've got two directories containing ~20 GB of music files (mostly mp3, some ogg), and I would like to detect all duplicate songs. There are two complicating factors:
A song may have different filenames in the two directories.
Two files containing the same song may have different ID3 tags and thus have different checksums.
What is a good approach to solving this?
The way I have gone about this in the past is to use genpuids that come from Music IP. The closed source software creates an audio fingerprint of a file regardless of format, id3, checksum etc.
More information can be found here.
This should ensure the most amount of positive duplicate matches and minimize false positives. It can also correctly tag incorrect id3 tags.
Here's what I would do (or have done before)...
Load all songs onto itunes (bear with me)
(note, if you can use itunes here, then stop ... I assume your list of dupes is long and unmanageable)
Delete all songs, sending them to the trash can, this way you get rid of the directory structure
Obviously, don't "empty trash". Rescue the songs to a folder on your desktop
Use software like mediamonkey, dupe eliminator or even itunes itself to identify the duplicates. Dupe eliminator is good in that it checks by a varying amount of factors, artist, length, filesize and whatnot and guesses what is a dupe and what isn't)
Reload onto Itunes, this time around check "Auto arrange songs", which will drop your new, dupeless list onto a nice by-artist-by-album arrangement
... voila! (or if you read digg: "...profit!")
/mp
If you have a library that can parse the files, you can run the hash on the audio data. This will not help you if the song is a different rip or has be recompressed/transcoded/etc.
Are the ID3/OGG-equiv artist and song metatags accurate? If they are, you could use those.
Edit: If they're not, perhaps they could be made to be... If you're only dealing with whole albums, there are several tools that will get all the tag data based on the number of tracks and their lengths.
If you're dealing with mixes of albums and single files, it gets more complicated.
I'm sure there's more elegant solutions out there - but if the audio data is equivalent, then stripping the ID3 tags and hashing should do the trick. After hashing, you can put the ID3 tags back if you like.
Perhaps the Last.fm API would be useful. It includes a track.getInfo call which returns XML including the track's length, artist name, track number, etc. You could compare tracks and see if they have more than N fields equal and if so, assume they're the same track.
I have no idea about whether they're going to be OK with you submitting API requests for 40gb of music, though.
How about something like this: find a library to get the mp3's length as well as a pointer to the audio data (looks like there are a couple libraries out there that can do this), do a first pass filter based on song lengths, and for the songs that have matching lengths checksum their audio data. Similar to this script for finding duplicate files / images.
Some adaptation of ffTES has worked great for me for a very similar task.
I was faced with the same problem, so I wrote a command-line program that tries to detect similar audio files by comparing acoustic fingerprints: https://github.com/derat/soundalike
It uses the fpcalc utility from Chromaprint to generate the fingerprints, and then builds a lookup table to find possible matches before comparing fingerprints more rigorously.
It worked pretty well when I ran against my music library, but there are various flags to tune its behavior if needed. If it works for you (or if it doesn't), let me know!