Solution for Turkish-I in C++ - c++

Hi all!
In Turkish one of the letters of alphabet has a different behaviour, it's I -and i-. In English I and i are upper & lower cases. In Turkish lowercase of I is not i, instead ı.
So in Turkish environment (ie Windows) "DOMAIN.COM" and "domain.com" are not equal. Since email transport & DNS are completely in English, if mail addresses contain uppercase I, there might be a problem.
In C# we may use InvariantCultureIgnoreCase flag to correct the issue:
// Mock
string localDomain = "domain.com";
string mailAddress = "USER#DOMAIN.COM";
string domainOfAddress = mailAddress.Split('#')[1];
string localInbox = "";
//
// Local inbox check
//Case insensitive method
bool ignoreCase = true; // Equal to StringComparison.CurrentCultureIgnoreCase
if (System.String.Compare(localDomain, domainOfAddress, ignoreCase) == 0)
{
// This one fails in Turkish environment
localInbox = mailAddress.Split('#')[0];
}
//Culture-safe version
if (System.String.Compare(localDomain, domainOfAddress, StringComparison.InvariantCultureIgnoreCase) == 0)
{
// This one is the correct/universal method
localInbox = mailAddress.Split('#')[0];
}
Since I'm not experienced at C++ what would be the C++ equivalents of these two examples?

If you are programming in Windows, you may change locale of your thread to en_US and then use _stricmp, or create a locale object for en_US and then use _stricmp_l:
setlocale( LC_CTYPE, "English" );
assert( _stricmp("DOMAIN", "domain") == 0 );
_locale_t l = _create_locale( LC_CTYPE, "English" );
assert( _stricmp_l("DOMAIN", "domain", l) == 0 );
If boost is an option for you, a more portable and C++ friendly solution is to use boost::locale library

Related

How to validate e-mail address with C++ using CAtlRegExp

I need to be able to validate various formats of international email addresses in C++. I've been finding many of the answers online don't cut it and I found a solution that works well for me that I thought I would share for anyone that is using ATL Server Library
Some background. I started with this post: Using a regular expression to validate an email address. Which pointed to http://emailregex.com/ that had a regular expression in various languages that supports the RFC 5322 Official Standard of the internet messaging format.
The regular expression provided is
(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])
I'm using C++ with ATL Server Library which once upon a time used to be part of Visual Studio. Microsoft has since put it on CodePlex as open source. We use it still for some of the template libraries. My goal is to modify this regular expression so it works with CAtlRegEx
The regular expression engine (CAtlRegExp) in ATL is pretty basic. I was able to modify the regular expression as follows:
^{([a-z0-9!#$%&'+/=?^_`{|}~\-]+(\.([a-z0-9!#$%&'+/=?^_`{|}~\-]+))*)#(((a-z0-9?\.)+a-z0-9?)|(\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\]))}$
The only thing that appears to be lost is Unicode support in domain names which I was able to solve by following the C# example in the How to: Verify that Strings Are in Valid Email Format article on MSDN by using IdnToAscii.
In this approach the user name and domain name are extracted from the email address. The domain name is converted to Ascii using IdnToAscii and then the two are put back together and then ran through the regular expression.
Please be aware that error handling was omitted for readability. Code is needed to make sure there are no buffer overruns and other error handling. Someone passing an email address over 255 characters will cause this example to crash.
Code:
bool WINAPI LocalLooksLikeEmailAddress(LPCWSTR lpszEmailAddress)
{
bool bRetVal = true ;
const int ccbEmailAddressMaxLen = 255 ;
wchar_t achANSIEmailAddress[ccbEmailAddressMaxLen] = { L'\0' } ;
ATL::CAtlRegExp<> regexp ;
ATL::CAtlREMatchContext<> regexpMatch ;
ATL::REParseError status = regexp.Parse(L"^{.+}#{.+}$", FALSE) ;
if (status == REPARSE_ERROR_OK) {
if (regexp.Match(lpszEmailAddress, &regexpMatch) && regexpMatch.m_uNumGroups == 2) {
const CAtlREMatchContext<>::RECHAR* szStart = 0 ;
const CAtlREMatchContext<>::RECHAR* szEnd = 0 ;
regexpMatch.GetMatch(0, &szStart, &szEnd) ;
::wcsncpy_s(achANSIEmailAddress, szStart, (size_t)(szEnd - szStart)) ;
regexpMatch.GetMatch(1, &szStart, &szEnd) ;
wchar_t achDomainName[ccbEmailAddressMaxLen] = { L'\0' } ;
::wcsncpy_s(achDomainName, szStart, (size_t)(szEnd - szStart)) ;
if (bRetVal) {
wchar_t achPunycode[ccbEmailAddressMaxLen] = { L'\0' } ;
if (IdnToAscii(0, achDomainName, -1, achPunycode, ccbEmailAddressMaxLen) == 0)
bRetVal = false ;
else {
::wcscat_s(achANSIEmailAddress, L"#") ;
::wcscat_s(achANSIEmailAddress, achPunycode) ;
}
}
}
}
if (bRetVal) {
status = regexp.Parse(
L"^{([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+(\\.([a-z0-9!#$%&'*+/=?^_`{|}~\\-]+))*)#((([a-z0-9]([a-z0-9\\-]*[a-z0-9])?\\.)+[a-z0-9]([a-z0-9\\-]*[a-z0-9])?)|(\\[(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)(((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\.)((2((5[0-5])|([0-4][0-9])))|(1[0-9][0-9])|([1-9]?[0-9]))\\]))}$"
, FALSE) ;
if (status == REPARSE_ERROR_OK) {
bRetVal = regexp.Match(achANSIEmailAddress, &regexpMatch) != 0;
}
}
return bRetVal ;
}
One thing worth mentioning is this approach did not agree with the results in the C# MSDN article for two of the email addresses. Looking the original regular expression listed on http://emailregex.com suggests that the MSDN Article got it wrong, unless the specification has recently been changed. I decided to go with the regular expression mentioned on http://emailregex.com
Here's my unit tests using the same email addresses from the MSDN Article
#include <Windows.h>
#if _DEBUG
#define TESTEXPR(expr) _ASSERTE(expr)
#else
#define TESTEXPR(expr) if (!(expr)) throw ;
#endif
void main()
{
LPCWSTR validEmailAddresses[] = { L"david.jones#proseware.com",
L"d.j#server1.proseware.com",
L"jones#ms1.proseware.com",
L"j#proseware.com9",
L"js#internal#proseware.com",
L"j_9#[129.126.118.1]",
L"js*#proseware.com", // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is invalid
// but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not.
// I'm going with valid
L"js#proseware.com9",
L"j.s#server1.proseware.com",
L"js#contoso.中国",
NULL } ;
LPCWSTR invalidEmailAddresses[] = { L"j.#server1.proseware.com",
L"\"j\\\"s\\\"\"#proseware.com", // <== according to https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx this is valid
// but according to http://emailregex.com/ that claims to support the RFC 5322 Official standard it's not.
// I'm going with Invalid
L"j..s#proseware.com",
L"js#proseware..com",
NULL } ;
for (LPCWSTR* emailAddress = validEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
{
TESTEXPR(LocalLooksLikeEmailAddress(*emailAddress)) ;
}
for (LPCWSTR* emailAddress = invalidEmailAddresses ; *emailAddress != NULL ; ++emailAddress)
{
TESTEXPR(!LocalLooksLikeEmailAddress(*emailAddress)) ;
}
}

want it to be trim my TCHAR * from end only

TCHAR* pszBackupPath;
m_Edt_ExportPath.GetWindowText(pszBackupPath, dwcchBackupPath);
StrTrim(pszBackupPath, L" ");
StrTrim(pszBackupPath, L"\\"); //this line has issue
iRet = _tcslen(pszBackupPath);
boRet = PathIsNetworkPath(pszBackupPath);
if (FALSE == boRet)
{
// MessageBox with string "Entered path is network path.
}
boRet = PathIsDirectory(pszBackupPath);
if (FALSE == boRet)
{
// MessageBox with string "Entered path is not a valid directory.
}
This is a part of my code in MFC.
I am passing a network path from UI. But because of StrTrim(pszBackupPath, L"\\") "\\" get trimmed from start and end. But I want it to be trimmed from end only.
I don't know any direct API. Please suggest.
There is a simple function to do that: PathRemoveBackslash (or PathCchRemoveBackslash for Windows 8 and later).

C++ Qt RegExp does not match special characters like # or | or ^

I am currently looking for a solution for my RegExp problem. I have worked through the docs and looked through the internet, but it almosts seems like that nobody uses RegExp - well at least, I cannot find what I am looking for.
I am currently working on a password strength checker according to the SANS institute for password policies. (http://www.sans.org/security-resources/policies/Password_Policy.pdf)
It says that a strong password contains:
“Special” characters (e.g. ##$%^&*()_+|~-=`{}[]:";'<>/ etc)
So I wanted to implement such into my Qt Project. The relevant code is:
bool specialChars = contains("[#|#|$|%|\^|&|*|(|)|_|+|\||~|-|=|\|`|{|}|\[|\]|:|\"|;|'|<|>|/]", password);
whereas contains is:
/**
* #brief PasswordStrenghChecker::contains
* #param needle
* #param hay
* #return
*/
bool PasswordStrengthChecker::contains(QString needle, QString hay)
{
QRegExp check(needle);
for(int i = 0; i < hay.length(); i++)
{
if ( check.exactMatch(hay.at(i)) )
{
return true ;
}
}
return false;
}
When I then check the results it shows, that # and any other character is not matched. Whats happening here ? The regex works fine for any other conditions such as:
bool digits = contains("\\d", password);
bool lowerCase = contains("[a-z]", password);
bool upperCase = contains("[A-Z]", password);
I am looking forward to your help.
Try to scape with backslash \ the characters you have found problems

Perl Win32::API and Pointers

I'm trying to utilize the Win32 API function DsGetSiteName() using Perl's Win32::API module. According to the Windows SDK, the function prototype for DsGetSiteName is:
DWORD DsGetSiteName(LPCTSTR ComputerName, LPTSTR *SiteName)
I successfully wrote a small C++ function using this API to get a better understanding of how it would actually work (I'm learning C++ on my own, but I digress).
Anyhow, from my understanding of the API documentation, the second parameter is supposed to be a pointer to a variable that receives a pointer to a string. In my C++ code, I wrote that as:
LPSTR site;
LPTSTR *psite = &site;
and have successfully called the API using the psite pointer.
Now my question is, is there a way to do the same using Perl's Win32::API? I've tried the following Perl code:
my $site = " " x 256;
my $computer = "devwin7";
my $DsFunc = Win32::API->new("netapi32","DWORD DsGetSiteNameA(LPCTSTR computer, LPTSTR site)");
my $DsResult = $DsFunc->Call($computer, $site);
print $site;
and the result of the call in $DsResult is zero (meaning success), but the data in $site is not what I want, it looks to be a mixture of ASCII and non-printable characters.
Could the $site variable be holding the pointer address of the allocated string? And if so, is there a way using Win32::API to dereference that address to get at the string?
Thanks in advance.
Win32::API can't handle char**. You'll need to extract the string yourself.
use strict;
use warnings;
use feature qw( say state );
use Encode qw( encode decode );
use Win32::API qw( );
use constant {
NO_ERROR => 0,
ERROR_NO_SITENAME => 1919,
ERROR_NOT_ENOUGH_MEMORY => 8,
};
use constant PTR_SIZE => $Config{ptrsize};
use constant PTR_FORMAT =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'L'
: die("Unrecognized ptrsize\n");
use constant PTR_WIN32API_TYPE =>
PTR_SIZE == 8 ? 'Q'
: PTR_SIZE == 4 ? 'N'
: die("Unrecognized ptrsize\n");
# Inefficient. Needs a C implementation.
sub decode_LPCWSTR {
my ($ptr) = #_;
return undef if !$ptr;
my $sW = '';
for (;;) {
my $chW = unpack('P2', pack(PTR_FORMAT, $ptr));
last if $chW eq "\0\0";
$sW .= $chW;
$ptr += 2;
}
return decode('UTF-16le', $sW);
}
sub NetApiBufferFree {
my ($Buffer) = #_;
state $NetApiBufferFree = Win32::API->new('netapi32.dll', 'NetApiBufferFree', PTR_WIN32API_TYPE, 'N')
or die($^E);
$NetApiBufferFree->Call($Buffer);
}
sub DsGetSiteName {
my ($ComputerName) = #_;
state $DsGetSiteName = Win32::API->new('netapi32.dll', 'DsGetSiteNameW', 'PP', 'N')
or die($^E);
my $packed_ComputerName = encode('UTF-16le', $ComputerName."\0");
my $packed_SiteName_buf_ptr = pack(PTR_FORMAT, 0);
$^E = $DsGetSiteName->Call($packed_ComputerName, $packed_SiteName_buf_ptr)
and return undef;
my $SiteName_buf_ptr = unpack(PTR_FORMAT, $packed_SiteName_buf_ptr);
my $SiteName = decode_LPCWSTR($SiteName_buf_ptr);
NetApiBufferFree($SiteName_buf_ptr);
return $SiteName;
}
{
my $computer_name = 'devwin7';
my ($site_name) = DsGetSiteName($computer_name)
or die("DsGetSiteName: $^E\n");
say $site_name;
}
All but decode_LPCWSTR is untested.
I used the WIDE interface instead of the ANSI interface. Using the ANSI interface is needlessly limiting.
PS — I wrote the code to which John Zwinck linked.
I think you're right about $site holding the address of a string. Here's some code that demonstrates the use of an output parameter with Perl's Win32 module:
http://www.perlmonks.org/?displaytype=displaycode;node_id=890698

Unicode Regex; Invalid XML characters

The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.
I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.
For systems that internally stores the codepoints in UTF-16, it is common to use surrogate pairs (xD800-xDFFF) for codepoints above 0xFFFF and in those systems you must verify if you really can use for example \u12345 or must specify that as a surrogate pair. (I just found out that in C# you can use \u1234 (16 bit) and \U00001234 (32-bit))
According to Microsoft "the W3C recommendation does not allow surrogate characters inside element or attribute names." While searching W3s website I found C079 and C078 that might be of interest.
I tried this in java and it works:
private String filterContent(String content) {
return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
}
Thank you Jeff.
The above solutions didn't work for me if the hex code was present in the xml. e.g.
<element></element>
The following code would break:
string xmlFormat = "<element>{0}</element>";
string invalid = " ";
string xml = string.Format(xmlFormat, invalid);
xml = Regex.Replace(xml, #"[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
XDocument.Parse(xml);
It returns:
XmlException: '', hexadecimal value 0x08, is an invalid character.
Line 1, position 14.
The following is the improved regex and fixed the problem mentioned above:
&#x([0-8BCEFbcef]|1[0-9A-Fa-f]);|[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]
Here is a unit test for the first 300 unicode characters and verifies that only invalid characters are removed:
[Fact]
public void validate_that_RemoveInvalidData_only_remove_all_invalid_data()
{
string xmlFormat = "<element>{0}</element>";
string[] allAscii = (Enumerable.Range('\x1', 300).Select(x => ((char)x).ToString()).ToArray());
string[] allAsciiInHexCode = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("X") + ";").ToArray());
string[] allAsciiInHexCodeLoweCase = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("x") + ";").ToArray());
bool hasParserError = false;
IXmlSanitizer sanitizer = new XmlSanitizer();
foreach (var test in allAscii.Concat(allAsciiInHexCode).Concat(allAsciiInHexCodeLoweCase))
{
bool shouldBeRemoved = false;
string xml = string.Format(xmlFormat, test);
try
{
XDocument.Parse(xml);
shouldBeRemoved = false;
}
catch (Exception e)
{
if (test != "<" && test != "&") //these char are taken care of automatically by my convertor so don't need to test. You might need to add these.
{
shouldBeRemoved = true;
}
}
int xmlCurrentLength = xml.Length;
int xmlLengthAfterSanitize = Regex.Replace(xml, #"&#x([0-8BCEF]|1[0-9A-F]);|[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "").Length;
if ((shouldBeRemoved && xmlCurrentLength == xmlLengthAfterSanitize) //it wasn't properly Removed
||(!shouldBeRemoved && xmlCurrentLength != xmlLengthAfterSanitize)) //it was removed but shouldn't have been
{
hasParserError = true;
Console.WriteLine(test + xml);
}
}
Assert.Equal(false, hasParserError);
}
Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid.
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
In PHP the regex would look like the following way:
protected function isStringValid($string)
{
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
return (preg_match($regex, $string, $matches) === 0);
}
This would handle all 3 ranges from the xml specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]