I'm looking at an open source QT4 game (http://cockatrice.de), and it uses QTranslator for internationalization. However, every phrase that refers to the player uses a masculine pronoun ("his hand", "his deck", "he does such-and-such", etc.)
My first thought for fixing this would be to just replace every instance of "his" or "he" with a variable that is set to the correct gendered pronoun, but I don't know how this would affect the translations. But for translating/internationalization, this might break, especially if the pronoun's gender affects the rest of the phrase.
Has anyone else worked with this sort of problem before? Is it possible to separate out the pronoun and the phrase in at least the easy languages (like English, in this case)? will the translation file have to include a copy of each phrase for each gendered pronoun? (doubling, at least, the size of the translation file)?
some sample code from how it's currently set up:
calling source:
case CaseNominative: return hisOwn ? tr("his hand", "nominative") : tr("%1's hand", "nominative").arg(ownerName);
case CaseGenitive: return hisOwn ? tr("of his hand", "genitive") : tr("of %1's hand", "genitive").arg(ownerName);
case CaseAccusative: return hisOwn ? tr("his hand", "accusative") : tr("%1's hand", "accusative").arg(ownerName);
english translation file:
<message>
<location filename="../src/cardzone.cpp" line="51"/>
<source>his hand</source>
<comment>nominative</comment>
<translation type="unfinished"></translation>
</message>
maybe split:
tr("his hand")
into:
tr("his").append("hand")
and then translate "his", "her", "its", ... seperately.
but there will be some problems in languages like italian, where there are personal suffixes...
alternative:
tr("his hand");
tr("her hand");
//...
and translate them all.
EDIT: the second alternative shown is also used in many other games (like oolite, ...), because it's the only way to be sure there will not be problems (like suffixes instead of prefixes...) in other languages.
btw what would Konami or Nintendo say when they'll realize you are creating an open source Yu-Gi-Oh!/Pokemon playing field? nobody will buy their cards :-P
Related
We have a problem here...
We have a text having different patterns of sentences.
We want to get the sentence having a particular word.
Eg:
One further point, by way of providing another model. The analysis in
the second paragraph could lead in the following direction. 'The
Destructors' deals with, obviously, destruction, whilst the book of
Genesis deals with creation. The vocabulary is similar: Blackie
notices that 'chaos had advanced', an ironic reversal of God's
imposing of form on a void. Furthermore, the phrase 'streaks of light
came in through the closed shutters where they worked with the
seriousness of creators', used in the context of destruction, also
parodies the creation of light and darkness in the early passages of
the Biblical book. Greene's ironic use of the vocabulary of the Bible
might be making the point that, for him, the Second World War
signalled the end of a particular Christian era. Now, it is perfectly
arguable that the rise of fascism is linked to this, or that it is the
cause. The cult of personality and secular leadership has, for Greene,
taken over from the key role of the church in Western societies. In
this way the two main themes identified above - the tension between
individual and community, and religion - are linked. In terms of essay
writing this link could well be made after the discussion of the theme
of the individual and the community, and its links with the theme of
leadership. This might be the general conclusion to the essay. After
thoughtful consideration and interpretation a student may well decide
that this is what the (destructors.)' boils down to: Greene is making a
clear link between the rise of fascism and the decline of the Church's
influence. Despite the fact that fascism has been recently defeated,
Greene sees the lack of any contemporary values which could provide
social cohesion as providing the potential for its reappearance.
In the above text, we have bold words (destructors). We want to get the sentences which are having the word "destructors".
The word "destructors" can be present in different formats. Eg: (destructors), (DesTrucTors), (Des.tructors), DESTRUCTORS, destructors, des-tructors.
When we tried writing a regex to match the sentences, we are failing to get the sentences at some conditions(like we are getting half sentences, etc.,).
Could you please help us with this.
If this information doesn't help you to solve, please let us know. Will update it.
Thank you...
I'm not too sure about Python, but I believe this might work:
for match in re.finditer(r"[^.]*destructors[^.]*\.[^\w\s]*", subject, re.IGNORECASE):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
In any case, I think the regex you want is:
[^.]*destructors[^.]*\.[^\w\s]*
with the case insensitive and global flags set.
It will be helpful if you could provide the regex pattern which you have tried with so far. The best I can come up with is,
str_text='your text here containing DESTRUCTORS'
match=re.search('pass all the destructors combination here', str_text, flags=re.IGNORECASE)
Try for more patterns available for string formatting with regex here,https://docs.python.org/3/library/re.html
I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"
You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.
It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?
What may be comparatively accurate expression using Tperlregex in delphi?
Thanks
First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:
He said: "It's OK!"
Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.
Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).
For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.
Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.
In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about
Most people called him Professor Jones, but to me he was simply The Prof.
Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be
Most people called him Professor Jones, but to me he was simply Prof. Bill.
Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.
It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.
The advantage of using this method is numerous. For example consider the input
Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.
If you are using a regex split, for example
string[] sentences = Regex.Split(text, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
Then the above input will be split as
Prof.
Jessica is a wonderful woman.
She is a native of U.
S.
A.
She is married to Mr.
Jacob Jr.
However the desired output is
Prof. Jessica is a wonderful woman.
She is a native of U.S.A. She is married to Mr. Jacob Jr.
This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
where mModelPath is the path of the directory containing the nbin file.
The mSentenceDetector is derived from the OpenNLP dll.
You can get the desired output by
string[] sentences = SplitSentences(text);
Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools
This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.
Let's assume the name only shows up once in the document.
No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.
If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.
Forget it - seriously.
Or expect to get a lot of applications from a Mr C Vitae
In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.
Obviously there's no way to do this 100% accurately, as you said, but this would be close.
Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...
That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.
I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.
The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.
tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.
You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.
Although this seems like a trivial question, I am quite sure it is not :)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
^[a-z -']+$
However, I need to support also these cases:
other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
no numbers or symbols or unnecessary punctuation or runes, etc..
titles, middle initials, suffixes are not part of this data
names are already separated by surnames.
we are prepared to force ultra rare names to be simplified (there's a person named '#' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
note that many countries have laws about names so there are standards to follow
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?
I would be looking for something similar to the many "email address" regexes that you can find on google.
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //' does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, #"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
I would just allow everything (except an empty string) and assume the user knows what his name is.
There are 2 common cases:
You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
In case (1), you can allow all characters because you're checking against a paper document.
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".
I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like ##$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.
EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:
s/[\<\>\"\'\%\;\(\)\&\+]//g;
"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)
v3.72, 2015-09-19 is a more recent version.
BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.
I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.
You could use the following regex code to validate 2 names separeted by a space with the following regex code:
^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$
or just use:
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-ZÀ-Ú]
[[:alpha:]] = [A-Za-zÀ-ú]
[[:alnum:]] = [A-Za-zÀ-ú0-9]
It's a very difficult problem to validate something like a name due to all the corner cases possible.
Corner Cases
Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.
A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
This one worked perfectly for me in JavaScript:
^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$
Here is the method:
function isValidName(name) {
var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
return found > -1;
}
Steps:
first remove all accents
apply the regular expression
To strip the accents:
private static string RemoveAccents(string s)
{
s = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
}
return sb.ToString();
}
This somewhat helps:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$
This one should work
^([A-Z]{1}+[a-z\-\.\']*+[\s]?)*
Add some special characters if you need them.