Parsing non-alphanumeric characters in QueryParser - c++

I am working on a former team mate's code using Lucene++ 3.0.3.
There is a comment that claims QueryParser cannot handle "special characters" and one way this has been handled is to replace "special characters" with a space:
if (((*pos) >= L'A' && (*pos) <= L'Z') ||
((*pos) >= L'a' && (*pos) <= L'z') ||
... ||
(*pos == L'-'))
{
// do nothing, these are OK
} else {
// remaining characters are []{}*
(*pos) = L' ';
}
StandardAnalyzer is the Analyzer being used. (Thanks Mark)
I assume the "special characters" are for combining queries or some sort of wildcard processing, for want of a better term.
Is there a better function that can account for these characters within a query string?

You need to look at what Analyzer is used, as the Analyzer determines the Tokenizer used (and the Tokenizer determines which characters are special).

Related

Ensure specific characters are in a string, regardless of position, using a regex in Golang

I am building a super simple function to ensure a password contains specific characters. Namely, the password should have the following:
One lowercase letter
One uppercase letter
One digit
One special character
No white space, #, or |
I thought regex would be the simplest way to go about doing this. But, I am having a hard time figuring out how to do this in Golang. Currently, I have a bunch of separate regex MatchString functions which I will combine to get the desired functionality. For example:
lowercaseMatch := regexp.MustCompile(`[a-z]`).MatchString
uppercaseMatch := regexp.MustCompile(`[A-Z]`).MatchString
digitMatch := regexp.MustCompile(`\d`).MatchString
specialMatch := regexp.MustCompile(`\W`).MatchString
badCharsMatch := regexp.MustCompile(`[\s#|]`).MatchString
if (lowercaseMatch(pwd)
&& uppercaseMatch(pwd)
&& digitMatch(pwd)
&& specialMatch(pwd)
&& !badCharsMatch(pwd)) {
/* password OK */
} else {
/* password BAD */
}
While this makes things pretty readable, I would prefer a more concise regex, but I don't know how to get regex to search for a single character of each of the above categories (regardless of position). Can someone point me in the right direction of how to achieve this? Additionally, if there is a better way to do this than regex, I am all ears.
Thanks!
Since golang use re2, it doesn't support positive-lookahead (?=regex), so I'm not sure if there is a way to write a regex that cover all cases.
Instead, you can use unicode package:
func verifyPassword(s string) bool {
var hasNumber, hasUpperCase, hasLowercase, hasSpecial bool
for _, c := range s {
switch {
case unicode.IsNumber(c):
hasNumber = true
case unicode.IsUpper(c):
hasUpperCase = true
case unicode.IsLower(c):
hasLowercase = true
case c == '#' || c == '|':
return false
case unicode.IsPunct(c) || unicode.IsSymbol(c):
hasSpecial = true
}
}
return hasNumber && hasUpperCase && hasLowercase && hasSpecial
}

Regex should allow German Umlauts in C#

I am using following regular expression:
[RegularExpression(#"^[A-Za-z0-9äöüÄÖÜß]+(?:[\._-äöüÄÖÜß][A-Za-z0-9]+)*$", ErrorMessageResourceName = "Error_User_UsernameFormat", ErrorMessageResourceType = typeof(Properties.Resources))]
Now I want to improve it the way it will allow German Umlauts(äöüÄÖÜß).
The way you added German letters to your regex, it will only be possible to use German letters in the first word.
You need to put the letters into the last character class:
#"^[A-Za-z0-9äöüÄÖÜß]+(?:[._-][A-Za-z0-9äöüÄÖÜß]+)*$"
^^^^^^^
See the regex demo
Also, note that _-ä creates a range inside a character class that matches a lot more than just a _, - and ä (and does not even match - as it is not present in the range).
Note that if you validate on the server side only, and want to match any Unicode letters, you may also consider using
#"^[\p{L}0-9]+(?:[._-][\p{L}0-9]+)*$"
Where \p{L} matches any Unicode letter. Another way to write [\p{L}0-9] would be [^\W_], but in .NET, it would also match all Unicode digits while 0-9 will only match ASCII digits.
replace [A-Za-z0-9äöüÄÖÜß] with [\w]. \w already contains Umlauts.
This works better i just modified somebody else his code who posted it on Stackoverflow. this works good for German language encoding.
I just added this code (c >= 'Ä' && c <= 'ä') and now it is working more towards my needs. Not all German letters are supported you need to create your own (c >= 'Ö' && c <= 'ö') type to add the letters u are having a issue with.
public static string RemoveSpecialCharacters(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= '0' && c <= '9') || (c >= 'Ö' && c <= 'ö') || (c >= 'Ü' && c <= 'ü') || (c >= 'Ä' && c <= 'ä') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == ' ')
{
sb.Append(c);
}
}
return clean(sb);
}

how to have regular expression for a textfield which accepts all characters except a comma (,) and do not accept a white space at both ends

How to write a regular expression for a text field which accepts all characters except a comma (,) and do not accept a white space at both the ends? I have tried
[^,][\B ]
but no use
like 'product generic no' instead of 'product,generic,no' or ' product generic no '
I suggest a solution without regular expression. As you said you're using JS so the function is in JavaScript:
function isItInvalid(str) {
var last = str.length - 1;
return (last < 2 ||
str[0] == ' ' ||
str[last] == ' ' ||
str.indexOf(',') != -1);
}
EDIT: Just made it a bit more readable. It also checks if the string is at least 3 chars.
Something like below:
/^\S[^,]*\S$/
Using a Perl regular expression
/^\S[^,]*\S$/
This should work from 2 characters up, but fails in the edge case where the string has only one non-comma character. To cover that too:
/^((\S[^,]*\S)|([^\s,]))$/

Regex to validate passwords with characters restrictions

I need to validate a password with these rules:
6 to 20 characters
Must contain at least one digit;
Must contain at least one letter (case insensitive);
Can contain the following characters: ! # # $ % & *
The following expression matches all but the last requirement. What can I do with the last one?
((?=.*\d)(?=.*[A-z]).{6,20})
I'm not completely sure I have this right, but since your last requirement is "Can contain the following characters: !##$%&*" I am assuming that other special characters are not allowed. In other words, the only allowed characters are letters, digits, and the special characters !##$%&*.
If this is the correct interpretation, the following regex should work:
^((?=.*\d)(?=.*[a-zA-Z])[a-zA-Z0-9!##$%&*]{6,20})$
Note that I changed your character class [A-z] to [a-zA-Z], because [A-z] will also include the following characters: [\]^_`
I also added beginning and end of string anchors to make sure you don't get a partial match.
^(?=.*\d)(?=.*[a-zA-Z])[a-zA-Z0-9!##$%&*]{6,20}$
Regex could be:-
^(?=.*\d)(?=.*[a-zA-Z])[a-zA-Z0-9!##$%&*]{6,20}$
How about this in Javascript:-
function checkPwd(str) {
if (str.length < 6) {
return("too_short");
} else if (str.length > 20) {
return("too_long");
} else if (str.search(/\d/) == -1) {
return("no_num");
} else if (str.search(/[a-zA-Z]/) == -1) {
return("no_letter");
} else if (str.search(/[^a-zA-Z0-9\!\#\#\$\%\^\&\*\(\)\_\+]/) != -1) {
return("bad_char");
}
return("ok");
}
Also check out this

Regular expression to match one or more characters every type?

There are 3 types of characters: A-Z, a-z and 0-9.
How to write regular expression to match words which have one or more characters in all three types?
For example:
Match: abAcc88, Ua8za8, 88aA
No match: abc, 118, aa7, xxZZ, XYZ111
This boost::regex re("^[A-Za-z0-9]+$"); doesn't work.
Thanks
At least IMO, trying to do this all with one regex is a poor idea. Though it's possible to make it work, you end up with an unreadable mess. The intent isn't apparent at all.
IMO, you'd be a lot better off expressing the logic more directly (though using a regex or two in the process won't hurt):
boost::regex lower("[a-z]");
boost::regex upper("[A-Z]");
boost::regex digit("[0-9]");
if (find(string, lower) && find(string,upper) && find(string,digit))
// it passes
else
// it fails
It takes little more than a glance for anybody with even the most minimal exposure to REs to figure out what this is doing (and even with no exposure to REs, it probably doesn't take really massive brilliance to figure out that a-z means "the characters from a to z").
Assuming you're testing each word separately:
boost::regex re("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])");
No need for anchors.
Actually, in case boost doesn't support lookarounds:
boost::regex re(".*[a-z].*([A-Z].*[0-9]|[0-9].*[A-Z])|.*[A-Z].*([a-z].*[0-9]|[0-9].*[a-z])|.*[0-9].*([a-z].*[A-Z]|[A-Z].*[a-z])");
This is every combination, as #Bill has pointed out.
(\w*[a-z]\w*[A-Z]\w*[0-9]\w*)|(\w*[a-z]\w*[0-9]\w*[A-Z]\w*)|(\w*[A-Z]\w*[a-z]\w*[0-9]\w*)|(\w*[A-Z]\w*[0-9]\w*[a-z]\w*)|(\w*[0-9]\w*[A-Z]\w*[a-z]\w*)|(\w*[0-9]\w*[a-z]\w*[A-Z]\w*)
l = lowerU = upperN = number
1. `(\w*[a-z]\w*[A-Z]\w*[0-9]\w*)` Match words __l__U___N___
2. `(\w*[a-z]\w*[0-9]\w*[A-Z]\w*)` Match words __l__N___U___
3. `(\w*[A-Z]\w*[a-z]\w*[0-9]\w*)` Match words __U__l___N___
4. `(\w*[0-9]\w*[A-Z]\w*[a-z]\w*)` Match words __U__N___l___
5. `(\w*[0-9]\w*[A-Z]\w*[a-z]\w*)` Match words __N__U___l___
6. `(\w*[0-9]\w*[a-z]\w*[A-Z]\w*)` Match words __N__l___U___
Well, if we're gonna go the non-regex route, then why not take it all the way ;-)
const char* c = "abAcc88";
char b = 0b000;
for (; *c; c++) b |= 48 <= *c && *c <= 57 ? 0b001 :
(65 <= *c && *c <= 90 ? 0b010 :
(97 <= *c && *c <= 122 ? 0b100 :
0b000 ));
if (b == 0b111)
{
std::cout << "pass" << std::endl;
}
(It's not readable, etc.; I'm kidding.)