I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.
Related
I have a project where I'm trying to enable other, possibly hostile, coders to label, in lowercase various properties that will be displayed in differing contexts, including embed in HTML, saved and manipulated in Postgres, used as attribute labels in JavaScript, and manipulated in the shell (say, saving a data file as продажи.zip) as well as various data analysis tools like graph-tool, etc.
I've worked on multilingual projects before, but they were either smaller customers that didn't need to especially worry about sophisticated attacks or they were projects that I came to after the multilingual aspect was in place, so I wasn't the one responsible for verifying security.
I'm pretty sure these should be safe, but I don't know if there are gotchas I need to look out for, like, say, a special [TAB] or [QUOTE] character in the Chinese character set that might escape my escaping.
Am I ok with these in my regex filter?
dash = '-'
english = 'a-z'
italian = ''
russain = 'а-я'
ukrainian = 'ґї'
german = 'äöüß'
spanish = 'ñ'
french = 'çéâêîôûàèùëï'
portuguese = 'ãõ'
polish = 'ąćęłńóśźż'
turkish = 'ğışç'
dutch = 'áíúýÿìò'
swedish = 'å'
danish = 'æø'
norwegian = ''
estonian = ''
romainian = 'șî'
greek = 'α-ωίϊΐόάέύϋΰήώ'
chinese = '([\p{Han}]+)'
japanese = '([\p{Hiragana}\p{Katakana}]+)'
korean = '([\p{Hangul}]+)'
If you restrict yourself to text encodings with a 7-bit ASCII compatible subset, you're reasonably safe treating anything above 0x7f (U+007f) as "safe" when interacting with most saneish programming languages and tools. If you use perl6 you're out of luck ;)
You should avoid supporting or take special care with input or output of text using the text encoding Shift-JIS, where the ¥ symbol is at 0x5c where \ would usually reside. This offers opportunities for nefarious trickery by exploiting encoding conversions.
Avoid or take extra care with other non-ascii-compatible encodings too. EBDIC is one, but you're unlikely to ever meet it in the wild. UTF-16 and UTF-32 obviously, but if you misprocess them the results are glaringly obvious.
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Personally I think your approach is backwards. You should define input and output functions to escape and unescape strings according to the lexical syntaxes of each target tool or language, rather than trying to prohibit any possible metacharacter. But then I don't know your situation, and maybe it's just impractical for what you're doing.
I'm not quite sure what your actual issue is. If you correctly convert your text to the target format, then you don't care what the text could possibly be. This will ensure both proper conversion AND security.
For instance:
If your text is to be included in HTML, it should be escaped using appropriate HTML quoting functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo "<span>".$variable."</span>"
Right:
// Actual encoding function varies based your environment
echo "<span>".htmlspecialchars($variable)."</span>"
Yes, this will also handle properly the case of text containing & or <.
If your text is to be used in an SQL query, you should use parameterised queries.
Example:
Wrong
// XXX DON'T DO THIS XXX
perform_sql_query("SELECT this FROM that WHERE thing=".$variable")
Right
// Actual syntax and function will vary
perform_sql_query("SELECT this FROM that WHERE thing=?", [$variable]);
If you text is to be included in JSON, just use appropriate JSON-encoding functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo '{"this":"'.$variable.'"}'
Right
// actual syntax and function may vary
echo json_encode({this: $variable});
The shell is a bit more tricky, and it's often a pain to deal with non-ASCII characters in many environments (e.g. FTP or doing an scp between different environments). So don't use explicit names for files, use identifiers (numeric id, uuid, hash...) and store the mapping to the actual name somewhere else (in a database).
I am writing a piece of code in c++ where in i need a word to syllable converter is there any open source standard algorithm available or any other links which can help me build one.
for a word like invisible syllable would be in-viz-uh-ble
it should be ideally be able to even parse complex words like "invisible".
I already found a link for algorithm in perl and python but i want to know if any library is available in c++
Thanks a lot.
Your example shows a phonetic representation of the word, not simply a split into syllables. This is a complex NLP issue.
Take a look at soundex and metaphone. There are C/C++ implementation for both.
Also many dictionaries provide the IPA notation of words. Take a look a Wiktionary API.
For detecting syllables in words, you could adapt a project of mine to your needs.
It's called tinyhyphenator.
It gives you an integer list of all possible hyphenation indices within a word. For German it renders quite exactly. You would have to obtain the index list and insert the hyphens yourself.
By "adapt" I mean adding the specification of English syllables. Take a look at the source code, it is supposed to be quite self explanatory.
For layouting we have our famous "Lorem ipsum" text to test how it looks like.
What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.
Example:
Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 8016 – 9F16. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.
Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)
Is there such a set of test-files for character-encoding issues out there?
The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:
Zażółć gęślą jaźń
which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:
in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).
in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).
List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:
public interface NationalCharacters {
String spanish();
String russian();
//...
}
library?
How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files
I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server
Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT
There are a few ready-to-use comprehensive unicode setups straight-forward downloadable.
From w3c
Here, there's a nice test file by w3.org including: maths, linguistics, Greek, Georgian, Russian, Thai, Runes, Braille among many others in a single file:
https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
Coming from w3.org should be nice to use, shouldn't it?
Cutting out the HTML part
If you want to get the "original txt file" without risk of your editor messing it up, 1) download, 2) tail+head it, 3) Check with a diff:
wget https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
tail +8 UTF-8-demo.html | head -n -3 > UTF-8-demo.txt
diff UTF-8-demo.html UTF-8-demo.txt
This generates a UTF-8-demo.txt without human intervention and without risk of loosing data.
More from w3c
There are many more files one level up in the directory structure, still inside the dir utf-8-test:
https://www.w3.org/2001/06/utf-8-test/
From github
There's a very interesting file here too with ALL printable chars (including Chinese, Braille, Arab, etc.)
https://raw.githubusercontent.com/bits/UTF-8-Unicode-Test-Documents/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_printable.txt
Want also non printable characters?
There are also many more test files in the same repo:
https://github.com/bits/UTF-8-Unicode-Test-Documents
and also a generator if you don't trust the committed file and you want to generate it by yourself.
My personal choice
I have decided that for my projects I'll start with 2 files: The specific one I pointed out from w3c and the specific one I pointed out from the github repo by bits.
Well, I had used an online tool to create my text char sets from Lorem Ipsum. I believe it can help you. I dont have one which has all the different charsets in a single page.
http://generator.lorem-ipsum.info/
I have a C++ program that was written by a Russian-speaking developer and so it contains Cyrillic characters. When I open the sources they are displayed as garbage. How do I solve this in windows ?
The actual problem is your IDE/editor doesn't display Cyrillic characters correctly. You solve this by changing the IDE/editor settings to use a font that contains Cyrillic characters - for example, Courier New if you're on Windows.
Well, assuming they've actually used ISO C and not some weird Russian variant, the language constructs and standard library calls will be in English (or its strange cousin, American).
The only thing you'll really need to convert are the strings (such as for user output or logging), code comments and variable names.
And even the comments and variable names may not have to change. They may make the code harder to understand to a non-Russian reader however.
If the code contains characters that your current editor doesn't understand, well, you need to get yourself an editor that does. Or get your Russian friends to turn it into English for you.
Don't think that there is another C++ programming language in russia. So you just need to replace the strings to the other language, i.e. English. Care must be taken when processing input since here you can find handling of single characters.
A better approach would be to prepare a localization. You can read all strings from a ressource or file. In that case you can select the resource that matches you target language.
If you mean that the strings of the program are written in Russian and you want to add English texts, you need to first internationalize (i18n) your program, using instead of static strings a library like Gettext; then you need to add support for the English locale.
If you mean that the variables and the comments are in Russian and you want them in English, well.. find a translator ;)
Find a translator and give him the code.
I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.