Unicode Regex; Invalid XML characters

Unicode Regex; Invalid XML characters - regex

The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

For systems that internally stores the codepoints in UTF-16, it is common to use surrogate pairs (xD800-xDFFF) for codepoints above 0xFFFF and in those systems you must verify if you really can use for example \u12345 or must specify that as a surrogate pair. (I just found out that in C# you can use \u1234 (16 bit) and \U00001234 (32-bit))
According to Microsoft "the W3C recommendation does not allow surrogate characters inside element or attribute names." While searching W3s website I found C079 and C078 that might be of interest.

I tried this in java and it works:
private String filterContent(String content) {
return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
}
Thank you Jeff.

The above solutions didn't work for me if the hex code was present in the xml. e.g.
<element></element>
The following code would break:
string xmlFormat = "<element>{0}</element>";
string invalid = " ";
string xml = string.Format(xmlFormat, invalid);
xml = Regex.Replace(xml, #"[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
XDocument.Parse(xml);
It returns:
XmlException: '', hexadecimal value 0x08, is an invalid character.
Line 1, position 14.
The following is the improved regex and fixed the problem mentioned above:
&#x([0-8BCEFbcef]|1[0-9A-Fa-f]);|[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]
Here is a unit test for the first 300 unicode characters and verifies that only invalid characters are removed:
[Fact]
public void validate_that_RemoveInvalidData_only_remove_all_invalid_data()
{
string xmlFormat = "<element>{0}</element>";
string[] allAscii = (Enumerable.Range('\x1', 300).Select(x => ((char)x).ToString()).ToArray());
string[] allAsciiInHexCode = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("X") + ";").ToArray());
string[] allAsciiInHexCodeLoweCase = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("x") + ";").ToArray());
bool hasParserError = false;
IXmlSanitizer sanitizer = new XmlSanitizer();
foreach (var test in allAscii.Concat(allAsciiInHexCode).Concat(allAsciiInHexCodeLoweCase))
{
bool shouldBeRemoved = false;
string xml = string.Format(xmlFormat, test);
try
{
XDocument.Parse(xml);
shouldBeRemoved = false;
}
catch (Exception e)
{
if (test != "<" && test != "&") //these char are taken care of automatically by my convertor so don't need to test. You might need to add these.
{
shouldBeRemoved = true;
}
}
int xmlCurrentLength = xml.Length;
int xmlLengthAfterSanitize = Regex.Replace(xml, #"&#x([0-8BCEF]|1[0-9A-F]);|[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "").Length;
if ((shouldBeRemoved && xmlCurrentLength == xmlLengthAfterSanitize) //it wasn't properly Removed
||(!shouldBeRemoved && xmlCurrentLength != xmlLengthAfterSanitize)) //it was removed but shouldn't have been
{
hasParserError = true;
Console.WriteLine(test + xml);
}
}
Assert.Equal(false, hasParserError);
}

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid.
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

In PHP the regex would look like the following way:
protected function isStringValid($string)
{
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
return (preg_match($regex, $string, $matches) === 0);
}
This would handle all 3 ranges from the xml specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Related

Angular Input Restriction Directive - Negating Regular Expressions

EDIT: Please feel free to add additional validations that would be useful for others, using this simple directive.
--
I'm trying to create an Angular Directive that limits the characters input into a text box. I've been successful with a couple common use cases (alphbetical, alphanumeric and numeric) but using popular methods for validating email addresses, dates and currency I can't get the directive to work since I need it negate the regex. At least that's what I think it needs to do.
Any assistance for currency (optional thousand separator and cents), date (mm/dd/yyyy) and email is greatly appreciated. I'm not strong with regular expressions at all.
Here's what I have currently:
http://jsfiddle.net/corydorning/bs05ys69/
HTML
<div ng-app="example">
<h1>Validate Directive</h1>
<p>The Validate directive allow us to restrict the characters an input can accept.</p>
<h3><code>alphabetical</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphabetical (A-Z, a-z) characters only.</p>
<label><input type="text" validate="alphabetical" ng-model="validate.alphabetical"/></label>
<h3><code>alphanumeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphanumeric (A-Z, a-z, 0-9) characters only.</p>
<label><input type="text" validate="alphanumeric" ng-model="validate.alphanumeric" /></label>
<h3><code>currency</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to US currency characters with comma for thousand separator (optional) and cents (optional).</p>
<label><input type="text" validate="currency.us" ng-model="validate.currency" /></label>
<h3><code>date</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to the mm/dd/yyyy date format only.</p>
<label><input type="text" validate="date" ng-model="validate.date" /></label>
<h3><code>email</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to email format only.</p>
<label><input type="text" validate="email" ng-model="validate.email" /></label>
<h3><code>numeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to numeric (0-9) characters only.</p>
<label><input type="text" validate="numeric" ng-model="validate.numeric" /></label>
JavaScript
angular.module('example', [])
.directive('validate', function () {
var validations = {
// works
alphabetical: /[^a-zA-Z]*$/,
// works
alphanumeric: /[^a-zA-Z0-9]*$/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/354044/what-is-the-best-u-s-currency-regex
currency: /^[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?$/,
// doesn't work - need to negate?
// taken from here: http://stackoverflow.com/questions/15196451/regular-expression-to-validate-datetime-format-mm-dd-yyyy
date: /(?:0[1-9]|1[0-2])\/(?:0[1-9]|[12][0-9]|3[01])\/(?:19|20)[0-9]{2}/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
email: /^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i,
// works
numeric: /[^0-9]*$/
};
return {
require: 'ngModel',
scope: {
validate: '#'
},
link: function (scope, element, attrs, modelCtrl) {
var pattern = validations[scope.validate] || scope.validate
;
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue.replace(pattern, '')
;
if (transformedInput != inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
}
};
});

I am pretty sure, there is better way, probably regex is also not best tool for that, but here is mine proposition.
This way you can only restrict which characters are allowed for input and to force user to use proper format, but you will need to also validate final input after user will finish typing, but this is another story.
The alphabetic, numeric and alphanumeric are quite simple, for input and validating input, as it is clear what you can type, and what is a proper final input. But with dates, mails, currency, you cannot validate input with regex for full valid input, as user need to type it in first, and in a meanwhile the input need to by invalid in terms of final valid input. So, this is one thing to for example restrict user to type just digits and / for a date format, like: 12/12/1988, but in the end you need to check if he typed proper date or just 12/12/126 for example. This need to be checked when answer is submited by user, or when text field lost focus, etc.
To just validate typed character, you can try with this:
JSFiddle DEMO
First change:
var transformedInput = inputValue.replace(pattern, '')
to
var transformedInput = inputValue.replace(pattern, '$1')
then use regular expressions:
/^([a-zA-Z]*(?=[^a-zA-Z]))./ - alphabetic
/^([a-zA-Z0-9]*(?=[^a-zA-Z0-9]))./ - alphanumeric
/(\.((?=[^\d])|\d{2}(?![^,\d.]))|,((?=[^\d])|\d{3}(?=[^,.$])|(?=\d{1,2}[^\d]))|\$(?=.)|\d{4,}(?=,)).|[^\d,.$]|^\$/- currency (allow string like: 343243.34, 1,123,345.34, .05 with or without $)
^(((0[1-9]|1[012])|(\d{2}\/\d{2}))(?=[^\/])|((\d)|(\d{2}\/\d{2}\/\d{1,3})|(.+\/))(?=[^\d])|\d{2}\/\d{2}\/\d{4}(?=.)).|^(1[3-9]|[2-9]\d)|((?!^)(3[2-9]|[4-9]\d)\/)|[3-9]\d{3}|2[1-9]\d{2}|(?!^)\/\d\/|^\/|[^\d/] - date (00-12/00-31/0000-2099)
/^(\d*(?=[^\d]))./ - numeric
/^([\w.$-]+\#[\w.]+(?=[^\w.])|[\w.$-]+\#(?=[^\w.-])|[\w.#-]+(?=[^\w.$#-])).$|\.(?=[^\w-#]).|[^\w.$#-]|^[^\w]|\.(?=#).|#(?=\.)./i - email
Generally, it use this pattern:
([valid characters or structure] captured in group $1)(?= positive lookahead for not allowed characters) any character
in effect it will capture all valid character in group $1, and if user type in an invalid character, whole string is replaced with already captured valid characters from group $1. It is complemented by part which shall exclude some obvious invalid character(s), like ## in a mail, or 34...2 in currency.
With understanding how these regular expression works, despite that it looks quite complex, I think it easy to extend it, by adding additional allowed/not allowed characters.
Regular expression for validating currency, dates and mails are easy to find, so I find it redundant to post them here.
OffTopic. Whats more the currency part in your demo is not working, it is bacause of: validate="currency.us" instead of validate="currency", or at least it works after this modification.

In my opinion it is impossible to create regular expressions that will work for matching things like dates or emails with the
parser you use. This is mainly because you would need non-capturing groups in your
regular expressions (which is possible), which are not replaced by the
inputValue.replace(pattern, '') call you have in your parser function. And this is the
part that is not possible in JavaScript. JavaScript replaces what you put in non-capturing
groups as well.
So... you'll need to go for a different approach. I would suggest to go for positive
regular expressions, which will yield a match when the input is valid.
Then you need of course to change the code of your parser. You could for instance
decide to chop off characters from the end of the input text until what remains passes
the regular expression test. This you could code as follows:
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue;
while (transformedInput && !pattern.exec(transformedInput)) {
// validation fails: chop off last character and try again
transformedInput = transformedInput.slice(0, -1);
}
if (transformedInput !== inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
Now life has become a bit easier. Just pay attention that you make your regular
expressions in such a way that they do not reject partial input. So "01/" should be
considered valid for a date, otherwise the user can never get to type in a date. On
the other hand, as soon as it becomes clear that adding characters will no longer
allow for valid input, the regular expression should reject it. So "101" should be
rejected as a date, as you can never add characters at the end to make it a valid date.
Also, all of these regular expressions should check the whole input, so as a consequence
they need to make use of the ^ and $ symbols.
Here is what the regular expression for a (partial) date could look like:
^([0-9]{0,2}|[0-9]{2}[\/]([0-9]{0,2}|[0-9]{2}[\/][0-9]{0,4}))$
This means: an input of 0 to 2 digits is valid, or exactly 2 digits followed by a slash, followed by either:
0 to 2 digits, or
exactly 2 digits followed by a slash, followed by 0 to 4 digits
Admittedly, not as smart as the one you had found, but that one would need a lot of editing to allow for partially entered dates. It is possible, but
it represents a very long expression with a lot of brackets and |.
Once you have all the regular expressions set up, you could think to further improve
the parser. One idea would be to not let it chop off characters from the end, but to
let it test all strings with one character removed somewhere compared to the original,
and see which one passes the test. If there is no way found to remove one character and have
success, then remove two consecutive characters in any place of the input value,
then three, ... etc, until you find a value that passes the test or arrive at an empty value.
This will work better for cases where the user inserts characters half way their input.
Just an idea...

import { Directive, ElementRef, EventEmitter, HostListener, Input, Output, Renderer2 } from '#angular/core';
import { ControlValueAccessor, NG_VALUE_ACCESSOR } from '#angular/forms';
import { CurrencyPipe, DecimalPipe } from '#angular/common';
import { ValueChangeEvent } from '#goomTool/goom-elements/events/value-change-event.model';
const noOperation = () => {
};
#Directive({
selector: '[formattedNumber]',
providers: [{
provide: NG_VALUE_ACCESSOR,
useExisting: FormattedNumberDirective,
multi: true
}]
})
export class FormattedNumberDirective implements ControlValueAccessor {
#Input() public configuration;
#Output() public valueChange: EventEmitter<ValueChangeEvent> = new EventEmitter();
public locale: string = process.env.LOCALE;
private el: HTMLInputElement;
// Keeps track of the value without formatting
private innerInputValue: any;
private specialKeys: string[] =
['Backspace', 'Tab', 'End', 'Home', 'Enter', 'Shift', 'ArrowRight', 'ArrowLeft', 'Delete'];
private onTouchedCallback: () => void = noOperation;
private onChangeCallback: (a: any) => void = noOperation;
constructor(private elementRef: ElementRef,
private decimalPipe: DecimalPipe,
private currencyPipe: CurrencyPipe,
private renderer: Renderer2) {
this.el = elementRef.nativeElement;
}
public writeValue(value: any) {
if (value !== this.innerInputValue) {
if (!!value) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', this.getFormattedValue(value));
}
this.innerInputValue = value;
}
}
public registerOnChange(fn: any) {
this.onChangeCallback = fn;
}
public registerOnTouched(fn: any) {
this.onTouchedCallback = fn;
}
// On Focus remove all non-digit ,display actual value
#HostListener('focus', ['$event.target.value'])
public onfocus(value) {
if (!!this.innerInputValue) {
this.el.value = this.innerInputValue;
}
}
// On Blur set values to pipe format
#HostListener('blur', ['$event.target.value'])
public onBlur(value) {
this.innerInputValue = value;
if (!!value) {
this.el.value = this.getFormattedValue(value);
}
}
/**
* Allows special key, Unit Interval, value based on regular expression
*
* #param event
*/
#HostListener('keydown', ['$event'])
public onKeyDown(event) {
// Allow Backspace, tab, end, and home keys . .
if (this.specialKeys.indexOf(event.key) !== -1) {
if (event.key === 'Backspace') {
this.updateValue(this.getBackSpaceValue(this.el.value, event));
}
if (event.key === 'Delete') {
this.updateValue(this.getDeleteValue(this.el.value, event));
}
return;
}
const next: string = this.concatAtIndex(this.el.value, event);
if (this.configuration.angularPipe && this.configuration.angularPipe.length > 0) {
if (!this.el.value.includes('.')
&& (this.configuration.min == null || this.configuration.min < 1)) {
if (next.startsWith('0') || next.startsWith('0.') || next.startsWith('.')) {
if (next.length > 1) {
this.updateValue(next);
}
return;
}
}
}
/* pass your pattern in component regex e.g.
* regex = new RegExp(RegexPattern.WHOLE_NUMBER_PATTERN)
*/
if (next && !String(next).match(this.configuration.regex)) {
event.preventDefault();
return;
}
if (!!this.configuration.minFractionDigits && !!this.configuration.maxFractionDigits) {
if (!!next.split('\.')[1] && next.split('\.')[1].length > this.configuration.minFractionDigits) {
return this.validateFractionDigits(next, event);
}
}
this.innerInputValue = next;
this.updateValue(next);
}
private updateValue(newValue) {
this.onTouchedCallback();
this.onChangeCallback(newValue);
if (newValue) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', newValue);
}
}
private validateFractionDigits(next, event) {
// create real-time pattern to validate min & max fraction digits
const regex = `^[-]?\\d+([\\.,]\\d{${this.configuration.minFractionDigits},${this.configuration.maxFractionDigits}})?$`;
if (!String(next).match(regex)) {
event.preventDefault();
return;
}
this.updateValue(next);
}
private concatAtIndex(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) + event.key +
current.slice(event.currentTarget.selectionEnd);
}
private getBackSpaceValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart - 1) +
current.slice(event.currentTarget.selectionEnd);
}
private getDeleteValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) +
current.slice(event.currentTarget.selectionEnd + 1);
}
private transformCurrency(value) {
return this.currencyPipe.transform(value, this.configuration.currencyCode, this.configuration.display,
this.configuration.digitsInfo, this.locale);
}
private transformDecimal(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale);
}
private transformPercent(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale) + ' %';
}
private getFormattedValue(value) {
switch (this.configuration.angularPipe) {
case ('decimal'): {
return this.transformDecimal(value);
}
case ('currency'): {
return this.transformCurrency(value);
}
case ('percent'): {
return this.transformPercent(value);
}
default: {
return value;
}
}
}
}
----------------------------------
export const RegexPattern = Object.freeze({
PERCENTAGE_PATTERN: '^([1-9]\\d*(\\.)\\d*|0?(\\.)\\d*[1-9]\\d*|[1-9]\\d*)$', // e.g. '.12% ' or 12%
DECIMAL_PATTERN: '^(([-]+)?([1-9]\\d*(\\.|\\,)\\d*|0?(\\.|\\,)\\d*[1-9]\\d*|[1-9]\\d*))$', // e.g. '123.12'
CURRENCY_PATTERN: '\\$?[-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?$', // e.g. '$123.12'
KEY_PATTERN: '^[a-zA-Z\\-]+-[0-9]+', // e.g. ABC-1234
WHOLE_NUMBER_PATTERN: '^([-]?([1-9][0-9]*)|([0]+)$)$' // e.g 1234
});

String replace with dictionary exception handling

I've implemented the answer here to do token replacements of a string:
https://stackoverflow.com/a/1231815/1224021
My issue now is when this method finds a token with a value that is not in the dictionary. I get the exception "The given key was not present in the dictionary." and just return the normal string. What I'd like to happen obviously is all the good tokens get replaced, but the offending one remains au naturale. Guessing I'll need to do a loop vs. the one line regex replace? Using vb.net. Here's what I'm currently doing:
Shared ReadOnly re As New Regex("\$(\w+)\$", RegexOptions.Compiled)
Public Shared Function GetTokenContent(ByVal val As String) As String
Dim retval As String = val
Try
If Not String.IsNullOrEmpty(val) AndAlso val.Contains("$") Then
Dim args = GetRatesDictionary()
retval = re.Replace(val, Function(match) args(match.Groups(1).Value))
End If
Catch ex As Exception
' not sure how to handle?
End Try
Return retval
End Function

The exception is likely thrown in the line
retval = re.Replace(val, Function(match) args(match.Groups(1).Value))
because this is the only place you are keying the dictionary. Make use of the Dictionary.ContainsKey method before accessing it.
retval = re.Replace(val,
Function(match)
return If(args.ContainsKey(match.Groups(1).Value), args(match.Groups(1).Value), val)
End Function)

This is what I got to work vs. the regex, which was also a suggestion on the original thread by Allen Wang: https://stackoverflow.com/a/7957728/1224021
Public Shared Function GetTokenContent(ByVal val As String) As String
Dim retval As String = val
Try
If Not String.IsNullOrEmpty(val) AndAlso val.Contains("$") Then
Dim args = GetRatesDictionary("$")
retval = args.Aggregate(val, Function(current, value) current.Replace(value.Key, value.Value))
End If
Catch ex As Exception
End Try
Return retval
End Function

I know it's been a while since this question was answered, but FYI for anyone wanting to still use the Regex / Dictionary match approach, the following works (based on the sample in the OP question):
retVal = re.Replace(formatString,
match => args.ContainsKey(match.Groups[1].Captures[0].Value)
? args[match.Groups[1].Captures[0].Value]
: string.Empty);
... or my full sample as a string extension method is:
public static class StringExtensions
{
// Will replace parameters enclosed in double curly braces
private static readonly Lazy<Regex> ParameterReplaceRegex = new Lazy<Regex>(() => new Regex(#"\{\{(?<key>\w+)\}\}", RegexOptions.Compiled));
public static string InsertParametersIntoFormatString(this string formatString, string parametersJsonArray)
{
if (parametersJsonArray != null)
{
var deserialisedParamsDictionary = JsonConvert.DeserializeObject<Dictionary<string, string>>(parametersJsonArray);
formatString = ParameterReplaceRegex.Value.Replace(formatString,
match => deserialisedParamsDictionary.ContainsKey(match.Groups[1].Captures[0].Value)
? deserialisedParamsDictionary[match.Groups[1].Captures[0].Value]
: string.Empty);
}
return formatString;
}
}
There are a few things to note here:
1) My parameters are passed in as a JSON array, e.g.: {"ProjectCode":"12345","AnotherParam":"Hi there!"}
2) The actual template / format string to do the replacements on has the parameters enclosed in double curly braces: "This is the Project Code: {{ProjectCode}}, this is another param {{AnotherParam}}"
3) Regex is both Lazy initialized and Compiled to suit my particular use case of:
the screen this code serves may not be used often
but once it is, it will get heavy use
so it should be as efficient on subsequent calls as possible.

C++ Qt RegExp does not match special characters like # or | or ^

I am currently looking for a solution for my RegExp problem. I have worked through the docs and looked through the internet, but it almosts seems like that nobody uses RegExp - well at least, I cannot find what I am looking for.
I am currently working on a password strength checker according to the SANS institute for password policies. (http://www.sans.org/security-resources/policies/Password_Policy.pdf)
It says that a strong password contains:
“Special” characters (e.g. ##$%^&*()_+|~-=`{}[]:";'<>/ etc)
So I wanted to implement such into my Qt Project. The relevant code is:
bool specialChars = contains("[#|#|$|%|\^|&|*|(|)|_|+|\||~|-|=|\|`|{|}|\[|\]|:|\"|;|'|<|>|/]", password);
whereas contains is:
/**
* #brief PasswordStrenghChecker::contains
* #param needle
* #param hay
* #return
*/
bool PasswordStrengthChecker::contains(QString needle, QString hay)
{
QRegExp check(needle);
for(int i = 0; i < hay.length(); i++)
{
if ( check.exactMatch(hay.at(i)) )
{
return true ;
}
}
return false;
}
When I then check the results it shows, that # and any other character is not matched. Whats happening here ? The regex works fine for any other conditions such as:
bool digits = contains("\\d", password);
bool lowerCase = contains("[a-z]", password);
bool upperCase = contains("[A-Z]", password);
I am looking forward to your help.

Try to scape with backslash \ the characters you have found problems

XslCompiledTransform.Transform replaces """ with real quotes

Doing something like this:
using (XmlWriter myMamlHelpWriter = XmlWriter.Create(myFileStream, XmlHelpExToMamlXslTransform.OutputSettings))
{
XmlHelpExToMamlXslTransform.Transform(myMsHelpExTopicFilePath, null, myMamlHelpWriter);
}
where
private static XslCompiledTransform XmlHelpExToMamlXslTransform
{
get
{
if (fMsHelpExToMamlXslTransform == null)
{
// Create the XslCompiledTransform and load the stylesheet.
fMsHelpExToMamlXslTransform = new XslCompiledTransform();
using (Stream myStream = typeof(XmlHelpBuilder).Assembly.GetManifestResourceStream(
typeof(XmlHelpBuilder),
MamlXmlTopicConsts.cMsHelpExToMamlTransformationResourceName))
{
XmlTextReader myReader = new XmlTextReader(myStream);
fMsHelpExToMamlXslTransform.Load(myReader, null, null);
}
}
return fMsHelpExToMamlXslTransform;
}
}
And every time the string """ is replaced with real quotes in the result file.
Cannot understand why this happens...

The reason is that in the XSLT's internal representation, " is exactly the same characer as ". They both represent the ascii code point 0x34. It would seem that when the XslCompiledTransform produces its output, it uses " where it's legal to do so. I would imagine that it would still output " inside an attribute value.
Is it a problem for you that " is produced as " in the output?
I just ran the following XSLT in Visual Studio using an arbitrary input file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/*">
<xml>
<xsl:variable name="chars">"&apos;<>&</xsl:variable>
<node a='{$chars}' b="{$chars}">
<xsl:value-of select="$chars"/>
</node>
</xml>
</xsl:template>
</xsl:stylesheet>
The output was:
<xml>
<node a=""'<>&" b=""'<>&">"'<>&</node>
</xml>
As you can see, even though all five characters were represented as entities originally, the apostrophies are produced as ' everywhere, and quotation marks are produced as " in text nodes. Furthermore, the a attribute which had ' delimiters uses " delimiters in the output. As I said, as far as the XSLT cares, a quotation mark is just a quotation mark, and an attribute is just an attribute. How those are produced in the output is up to the XSLT processor.
Edit: The root cause of this behavior appears to be the behavior of the XmlWriter class. It looks like the general suggestion for those who want more customized escaping is to extend the XmlTextWriter class. This page has an implementation that looks fairly promising:
public class KeepEntityXmlTextWriter : XmlTextWriter
{
private static readonly string[] ENTITY_SUBS = new string[] { "&apos;", """ };
private static readonly char[] REPLACE_CHARS = new char[] { '\'', '"' };
public KeepEntityXmlTextWriter(string filename) : base(filename, null) { ; }
private void WriteStringWithReplace(string text)
{
string[] textSegments = text.Split(KeepEntityXmlTextWriter.REPLACE_CHARS);
if (textSegments.Length > 1)
{
for (int pos = -1, i = 0; i < textSegments.Length; ++i)
{
base.WriteString(textSegments[i]);
pos += textSegments[i].Length + 1;
// Assertion: Replace the following if-else when the number of
// replacement characters and substitute entities has grown
// greater than 2.
Debug.Assert(2 == KeepEntityXmlTextWriter.REPLACE_CHARS.Length);
if (pos != text.Length)
{
if (text[pos] == KeepEntityXmlTextWriter.REPLACE_CHARS[0])
base.WriteRaw(KeepEntityXmlTextWriter.ENTITY_SUBS[0]);
else
base.WriteRaw(KeepEntityXmlTextWriter.ENTITY_SUBS[1]);
}
}
}
else base.WriteString(text);
}
public override void WriteString( string text)
{
this.WriteStringWithReplace(text);
}
}
On the other hand, the MSDN documentation recommends using XmlWriter.Create() rather than instantiating XmlTextWriters directly.
In the .NET Framework 2.0 release, the recommended practice is to create XmlWriter instances using the XmlWriter.Create method and the XmlWriterSettings class. This allows you to take full advantage of all the new features introduced in this release. For more information, see Creating XML Writers.
One way around that would be to use the same logic as above, but put it in a class that wraps an XmlWriter. This page has a ready-made implementation of an XmlWrappingWriter, that you can modify as needed.
To use the above code with the XmlWrappingWriter, you would subclass the wrapping writer, like this:
public class KeepEntityWrapper : XmlWrappingWriter
{
public KeepEntityWrapper(XmlWriter baseWriter)
: base(baseWriter)
{
}
private static readonly string[] ENTITY_SUBS = new string[] { "&apos;", """ };
private static readonly char[] REPLACE_CHARS = new char[] { '\'', '"' };
private void WriteStringWithReplace(string text)
{
string[] textSegments = text.Split(REPLACE_CHARS);
if (textSegments.Length > 1)
{
for (int pos = -1, i = 0; i < textSegments.Length; ++i)
{
base.WriteString(textSegments[i]);
pos += textSegments[i].Length + 1;
// Assertion: Replace the following if-else when the number of
// replacement characters and substitute entities has grown
// greater than 2.
Debug.Assert(2 == REPLACE_CHARS.Length);
if (pos != text.Length)
{
if (text[pos] == REPLACE_CHARS[0])
base.WriteRaw(ENTITY_SUBS[0]);
else
base.WriteRaw(ENTITY_SUBS[1]);
}
}
}
else base.WriteString(text);
}
public override void WriteString(string text)
{
this.WriteStringWithReplace(text);
}
}
Note this essentially the same code as the KeepEntityXmlTextWriter, but using XmlWrappingWriter as the base class and with a different constructor.
I don't recognize the Guard that the XmlWrappingWriter code is using in two places, but given that you'll be consuming the code yourself, it should be pretty safe to delete the lines like this. They just ensure that a null value isn't passed to the constructor or the (in the above case inaccessible) BaseWriter property:
Guard.ArgumentNotNull(baseWriter, "baseWriter");
To create an instance of the XmlWrappingWriter, you would create an XmlWriter however you need to, and then use:
KeepEntityWrapper wrap = new KeepEntityWrapper(writer);
And then you'd use this wrap variable as the XmlWriter you pass to your XSL transform.

The XSLT processor doesn't know whether a character was represented by a character entity or not. This is because the XML parser substitutes any character entity with its code-value.
Therefore, the XSLT processor would see exactly the same character, regardless whether it was represented as " or as " or as " or as ".
What you want can be achieved in XSLT 2.0 by using the so called "character maps".

Here is the trick you wanted:
replace all & with &
perform XSLT
replace all & with &

How do I check if a filename matches a wildcard pattern

I've got a wildcard pattern, perhaps "*.txt" or "POS??.dat".
I also have list of filenames in memory that I need to compare to that pattern.
How would I do that, keeping in mind I need exactly the same semantics that IO.DirectoryInfo.GetFiles(pattern) uses.
EDIT: Blindly translating this into a regex will NOT work.

I have a complete answer in code for you that's 95% like FindFiles(string).
The 5% that isn't there is the short names/long names behavior in the second note on the MSDN documentation for this function.
If you would still like to get that behavior, you'll have to complete a computation of the short name of each string you have in the input array, and then add the long name to the collection of matches if either the long or short name matches the pattern.
Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace FindFilesRegEx
{
class Program
{
static void Main(string[] args)
{
string[] names = { "hello.t", "HelLo.tx", "HeLLo.txt", "HeLLo.txtsjfhs", "HeLLo.tx.sdj", "hAlLo20984.txt" };
string[] matches;
matches = FindFilesEmulator("hello.tx", names);
matches = FindFilesEmulator("H*o*.???", names);
matches = FindFilesEmulator("hello.txt", names);
matches = FindFilesEmulator("lskfjd30", names);
}
public string[] FindFilesEmulator(string pattern, string[] names)
{
List<string> matches = new List<string>();
Regex regex = FindFilesPatternToRegex.Convert(pattern);
foreach (string s in names)
{
if (regex.IsMatch(s))
{
matches.Add(s);
}
}
return matches.ToArray();
}
internal static class FindFilesPatternToRegex
{
private static Regex HasQuestionMarkRegEx = new Regex(#"\?", RegexOptions.Compiled);
private static Regex IllegalCharactersRegex = new Regex("[" + #"\/:<>|" + "\"]", RegexOptions.Compiled);
private static Regex CatchExtentionRegex = new Regex(#"^\s*.+\.([^\.]+)\s*$", RegexOptions.Compiled);
private static string NonDotCharacters = #"[^.]*";
public static Regex Convert(string pattern)
{
if (pattern == null)
{
throw new ArgumentNullException();
}
pattern = pattern.Trim();
if (pattern.Length == 0)
{
throw new ArgumentException("Pattern is empty.");
}
if(IllegalCharactersRegex.IsMatch(pattern))
{
throw new ArgumentException("Pattern contains illegal characters.");
}
bool hasExtension = CatchExtentionRegex.IsMatch(pattern);
bool matchExact = false;
if (HasQuestionMarkRegEx.IsMatch(pattern))
{
matchExact = true;
}
else if(hasExtension)
{
matchExact = CatchExtentionRegex.Match(pattern).Groups[1].Length != 3;
}
string regexString = Regex.Escape(pattern);
regexString = "^" + Regex.Replace(regexString, #"\\\*", ".*");
regexString = Regex.Replace(regexString, #"\\\?", ".");
if(!matchExact && hasExtension)
{
regexString += NonDotCharacters;
}
regexString += "$";
Regex regex = new Regex(regexString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
return regex;
}
}
}
}

You can simply do this. You do not need regular expressions.
using Microsoft.VisualBasic.CompilerServices;
if (Operators.LikeString("pos123.txt", "pos?23.*", CompareMethod.Text))
{
Console.WriteLine("Filename matches pattern");
}
Or, in VB.Net,
If "pos123.txt" Like "pos?23.*" Then
Console.WriteLine("Filename matches pattern")
End If
In c# you could simulate this with an extension method. It wouldn't be exactly like VB Like, but it would be like...very cool.

You could translate the wildcards into a regular expression:
*.txt -> ^.+\.txt$
POS??.dat _> ^POS..\.dat$
Use the Regex.Escape method to escape the characters that are not wildcars into literal strings for the pattern (e.g. converting ".txt" to "\.txt").
The wildcard * translates into .+, and ? translates into .
Put ^ at the beginning of the pattern to match the beginning of the string, and $ at the end to match the end of the string.
Now you can use the Regex.IsMatch method to check if a file name matches the pattern.

Just call the Windows API function PathMatchSpecExW().
[Flags]
public enum MatchPatternFlags : uint
{
Normal = 0x00000000, // PMSF_NORMAL
Multiple = 0x00000001, // PMSF_MULTIPLE
DontStripSpaces = 0x00010000 // PMSF_DONT_STRIP_SPACES
}
class FileName
{
[DllImport("Shlwapi.dll", SetLastError = false)]
static extern int PathMatchSpecExW([MarshalAs(UnmanagedType.LPWStr)] string file,
[MarshalAs(UnmanagedType.LPWStr)] string spec,
MatchPatternFlags flags);
/*******************************************************************************
* Function: MatchPattern
*
* Description: Matches a file name against one or more file name patterns.
*
* Arguments: file - File name to check
* spec - Name pattern(s) to search foe
* flags - Flags to modify search condition (MatchPatternFlags)
*
* Return value: Returns true if name matches the pattern.
*******************************************************************************/
public static bool MatchPattern(string file, string spec, MatchPatternFlags flags)
{
if (String.IsNullOrEmpty(file))
return false;
if (String.IsNullOrEmpty(spec))
return true;
int result = PathMatchSpecExW(file, spec, flags);
return (result == 0);
}
}

Some kind of regex/glob is the way to go, but there are some subtleties; your question indicates you want identical semantics to IO.DirectoryInfo.GetFiles. That could be a challenge, because of the special cases involving 8.3 vs. long file names and the like. The whole story is on MSDN.
If you don't need an exact behavioral match, there are a couple of good SO questions:
glob pattern matching in .NET
How to implement glob in C#

For anyone who comes across this question now that it is years later, I found over at the MSDN social boards that the GetFiles() method will accept * and ? wildcard characters in the searchPattern parameter. (At least in .Net 3.5, 4.0, and 4.5)
Directory.GetFiles(string path, string searchPattern)
http://msdn.microsoft.com/en-us/library/wz42302f.aspx

Plz try the below code.
static void Main(string[] args)
{
string _wildCardPattern = "*.txt";
List<string> _fileNames = new List<string>();
_fileNames.Add("text_file.txt");
_fileNames.Add("csv_file.csv");
Console.WriteLine("\nFilenames that matches [{0}] pattern are : ", _wildCardPattern);
foreach (string _fileName in _fileNames)
{
CustomWildCardPattern _patetrn = new CustomWildCardPattern(_wildCardPattern);
if (_patetrn.IsMatch(_fileName))
{
Console.WriteLine("{0}", _fileName);
}
}
}
public class CustomWildCardPattern : Regex
{
public CustomWildCardPattern(string wildCardPattern)
: base(WildcardPatternToRegex(wildCardPattern))
{
}
public CustomWildCardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(WildcardPatternToRegex(wildcardPattern), regexOptions)
{
}
private static string WildcardPatternToRegex(string wildcardPattern)
{
string patternWithWildcards = "^" + Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = patternWithWildcards.Replace("\\?", ".") + "$";
return patternWithWildcards;
}
}

For searching against a specific pattern, it might be worth using File Globbing which allows you to use search patterns like you would in a .gitignore file.
See here: https://learn.microsoft.com/en-us/dotnet/core/extensions/file-globbing
This allows you to add both inclusions & exclusions to your search.
Please see below the example code snippet from the Microsoft Source above:
Matcher matcher = new Matcher();
matcher.AddIncludePatterns(new[] { "*.txt" });
IEnumerable<string> matchingFiles = matcher.GetResultsInFullPath(filepath);

The use of RegexOptions.IgnoreCase will fix it.
public class WildcardPattern : Regex {
public WildcardPattern(string wildCardPattern)
: base(ConvertPatternToRegex(wildCardPattern), RegexOptions.IgnoreCase) {
}
public WildcardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(ConvertPatternToRegex(wildcardPattern), regexOptions) {
}
private static string ConvertPatternToRegex(string wildcardPattern) {
string patternWithWildcards = Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = string.Concat("^", patternWithWildcards.Replace("\\?", "."), "$");
return patternWithWildcards;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unicode Regex; Invalid XML characters - regex

I tried this in java and it works: private String filterContent(String content) { return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", ""); } Thank you Jeff.

Related

Angular Input Restriction Directive - Negating Regular Expressions

String replace with dictionary exception handling

C++ Qt RegExp does not match special characters like # or | or ^

XslCompiledTransform.Transform replaces """ with real quotes

How do I check if a filename matches a wildcard pattern

Categories

Resources