Powershell find non printing characters [duplicate] - regex

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);

you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Related

Dart RegExp white spaces is not recognized

I'm trying to implement a regex pattern for username that allows English letters, Arabic letters, digits, dash and space.
The following pattern always returns no match if the input string has a space even though \s is included in the pattern
Pattern _usernamePattern = r'^[a-zA-Z0-9\u0621-\u064A\-\s]{3,30}$';
I also tried replacing \s with " " and \\s but the regex always returns no matches for any input that has a space in it.
Edit: It turns out that flutter adds a unicode character for "Right-To-Left Mark" or "Left-To-Right Mark" when using a textfield with a mix of languages that go LTR or RTL. This additional mark is a unicode character that's gets added to the text. The regex above was failing because of this additional character. To resolve the issue simply do a replaceAll for these characters. Read more here: https://github.com/flutter/flutter/issues/56514.
This is a fairly nasty problem and worth documenting in an answer here.
As documented in the source:
/// When LTR text is entered into an RTL field, or RTL text is entered into an
/// LTR field, [LRM](https://en.wikipedia.org/wiki/Left-to-right_mark) or
/// [RLM](https://en.wikipedia.org/wiki/Right-to-left_mark) characters will be
/// inserted alongside whitespace characters, respectively. This is to
/// eliminate ambiguous directionality in whitespace and ensure proper caret
/// placement. These characters will affect the length of the string and may
/// need to be parsed out when doing things like string comparison with other
/// text.
While this is well-intended it can cause problems when you work with mixed LTR/RTL text patterns (as it is the case here) and have to ensure exact field length, etc.
The suggested solution is to remove all left-right-marks:
void main() {
final String lrm = 'aaaa \u{200e}bbbb';
print('lrm: "$lrm" with length ${lrm.length}');
final String lrmFree = lrm.replaceAll(RegExp(r'\u{200e}', unicode: true), '');
print('lrmFree: "$lrmFree" with length ${lrmFree.length}');
}
Related: right-to-left (RTL) in flutter

Regex replacing special characters in a string

I have numerical values that contain special characters and I would like to replace those special characters with "x"
I already tried [^\w*], and it will only work when there is one special character
When there is more than 1234?12?, it won't capture the second special character, what am i doing wrong?
Here is something you could use. It will replace all none numeric characters. Good luck!
var str = "rt5121212?232?2*dse%e&323"
var pattern = /([^![0-9])/gi;
var sanitized = str.replace(pattern,'');
console.log(sanitized);

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Regular expression for alphanumeric and underscores

Is there a regular expression which checks if a string contains only upper and lowercase letters, numbers, and underscores?
To match a string that contains only those characters (or an empty string), try
"^[a-zA-Z0-9_]*$"
This works for .NET regular expressions, and probably a lot of other languages as well.
Breaking it down:
^ : start of string
[ : beginning of character group
a-z : any lowercase letter
A-Z : any uppercase letter
0-9 : any digit
_ : underscore
] : end of character group
* : zero or more of the given characters
$ : end of string
If you don't want to allow empty strings, use + instead of *.
As others have pointed out, some regex languages have a shorthand form for [a-zA-Z0-9_]. In the .NET regex language, you can turn on ECMAScript behavior and use \w as a shorthand (yielding ^\w*$ or ^\w+$). Note that in other languages, and by default in .NET, \w is somewhat broader, and will match other sorts of Unicode characters as well (thanks to Jan for pointing this out). So if you're really intending to match only those characters, using the explicit (longer) form is probably best.
There's a lot of verbosity in here, and I'm deeply against it, so, my conclusive answer would be:
/^\w+$/
\w is equivalent to [A-Za-z0-9_], which is pretty much what you want (unless we introduce Unicode to the mix).
Using the + quantifier you'll match one or more characters. If you want to accept an empty string too, use * instead.
You want to check that each character matches your requirements, which is why we use:
[A-Za-z0-9_]
And you can even use the shorthand version:
\w
Which is equivalent (in some regex flavors, so make sure you check before you use it). Then to indicate that the entire string must match, you use:
^
To indicate the string must start with that character, then use
$
To indicate the string must end with that character. Then use
\w+ or \w*
To indicate "1 or more", or "0 or more". Putting it all together, we have:
^\w*$
Although it's more verbose than \w, I personally appreciate the readability of the full POSIX character class names ( http://www.zytrax.com/tech/web/regex.htm#special ), so I'd say:
^[[:alnum:]_]+$
However, while the documentation at the above links states that \w will "Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])", I have not found this to be true. Not with grep -P anyway. You need to explicitly include the underscore if you use [:alnum:] but not if you use \w. You can't beat the following for short and sweet:
^\w+$
Along with readability, using the POSIX character classes (http://www.regular-expressions.info/posixbrackets.html) means that your regex can work on non ASCII strings, which the range based regexes won't do since they rely on the underlying ordering of the ASCII characters which may be different from other character sets and will therefore exclude some non-ASCII characters (letters such as œ) which you might want to capture.
Um...question: Does it need to have at least one character or no? Can it be an empty string?
^[A-Za-z0-9_]+$
Will do at least one upper or lower case alphanumeric or underscore. If it can be zero length, then just substitute the + for *:
^[A-Za-z0-9_]*$
If diacritics need to be included (such as cedilla - ç) then you would need to use the word character which does the same as the above, but includes the diacritic characters:
^\w+$
Or
^\w*$
Use
^([A-Za-z]|[0-9]|_)+$
...if you want to be explicit, or:
^\w+$
...if you prefer concise (Perl syntax).
In computer science, an alphanumeric value often means the first character is not a number, but it is an alphabet or underscore. Thereafter the character can be 0-9, A-Z, a-z, or underscore (_).
Here is how you would do that:
Tested under PHP:
$regex = '/^[A-Za-z_][A-Za-z\d_]*$/'
Or take
^[A-Za-z_][A-Za-z\d_]*$
and place it in your development language.
Use lookaheads to do the "at least one" stuff. Trust me, it's much easier.
Here's an example that would require 1-10 characters, containing at least one digit and one letter:
^(?=.*\d)(?=.*[A-Za-z])[A-Za-z0-9]{1,10}$
Note: I could have used \w, but then ECMA/Unicode considerations come into play, increasing the character coverage of the \w "word character".
This works for me. I found this in the O'Reilly's "Mastering Regular Expressions":
/^\w+$/
Explanation:
^ asserts position at start of the string
\w+ matches any word character (equal to [a-zA-Z0-9_])
"+" Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string
Verify yourself:
const regex = /^\w+$/;
const str = `nut_cracker_12`;
let m;
if ((m = regex.exec(str)) !== null) {
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Try these multi-lingual extensions I have made for string.
IsAlphaNumeric - The string must contain at least one alpha (letter in Unicode range, specified in charSet) and at least one number (specified in numSet). Also, the string should consist only of alpha and numbers.
IsAlpha - The string should contain at least one alpha (in the language charSet specified) and consist only of alpha.
IsNumeric - The string should contain at least one number (in the language numSet specified) and consist only of numbers.
The charSet/numSet range for the desired language can be specified. The Unicode ranges are available on Unicode Chart.
API:
public static bool IsAlphaNumeric(this string stringToTest)
{
// English
const string charSet = "a-zA-Z";
const string numSet = #"0-9";
// Greek
//const string charSet = #"\u0388-\u03EF";
//const string numSet = #"0-9";
// Bengali
//const string charSet = #"\u0985-\u09E3";
//const string numSet = #"\u09E6-\u09EF";
// Hindi
//const string charSet = #"\u0905-\u0963";
//const string numSet = #"\u0966-\u096F";
return Regex.Match(stringToTest, #"^(?=[" + numSet + #"]*?[" + charSet + #"]+)(?=[" + charSet + #"]*?[" + numSet + #"]+)[" + charSet + numSet +#"]+$").Success;
}
public static bool IsNumeric(this string stringToTest)
{
//English
const string numSet = #"0-9";
//Hindi
//const string numSet = #"\u0966-\u096F";
return Regex.Match(stringToTest, #"^[" + numSet + #"]+$").Success;
}
public static bool IsAlpha(this string stringToTest)
{
//English
const string charSet = "a-zA-Z";
return Regex.Match(stringToTest, #"^[" + charSet + #"]+$").Success;
}
Usage:
// English
string test = "AASD121asf";
// Greek
//string test = "Ϡϛβ123";
// Bengali
//string test = "শর৩৮";
// Hindi
//string test = #"क़लम३७ख़";
bool isAlphaNum = test.IsAlphaNumeric();
The following regex matches alphanumeric characters and underscore:
^[a-zA-Z0-9_]+$
For example, in Perl:
#!/usr/bin/perl -w
my $arg1 = $ARGV[0];
# Check that the string contains *only* one or more alphanumeric chars or underscores
if ($arg1 !~ /^[a-zA-Z0-9_]+$/) {
print "Failed.\n";
} else {
print "Success.\n";
}
This should work in most of the cases.
/^[\d]*[a-z_][a-z\d_]*$/gi
And by most I mean,
abcd True
abcd12 True
ab12cd True
12abcd True
1234 False
Explanation
^ ... $ - match the pattern starting and ending with
[\d]* - match zero or more digits
[a-z_] - match an alphabet or underscore
[a-z\d_]* - match an alphabet or digit or underscore
/gi - match globally across the string and case-insensitive
For those of you looking for unicode alphanumeric matching, you might want to do something like:
^[\p{L} \p{Nd}_]+$
Further reading is at Unicode Regular Expressions (Unicode Consortium) and at Unicode Regular Expressions (Regular-Expressions.info).
For me there was an issue in that I want to distinguish between alpha, numeric and alpha numeric, so to ensure an alphanumeric string contains at least one alpha and at least one numeric, I used :
^([a-zA-Z_]{1,}\d{1,})+|(\d{1,}[a-zA-Z_]{1,})+$
Here is the regex for what you want with a quantifier to specify at least 1 character and no more than 255 characters
[^a-zA-Z0-9 _]{1,255}
I believe you are not taking Latin and Unicode characters in your matches.
For example, if you need to take "ã" or "ü" chars, the use of "\w" won't work.
You can, alternatively, use this approach:
^[A-ZÀ-Ýa-zà-ý0-9_]+$
^\w*$ will work for the below combinations:
1
123
1av
pRo
av1
For Java, only case insensitive alphanumeric and underscore are allowed.
^ Matches the string starting with any characters
[a-zA-Z0-9_]+ Matches alpha-numeric character and underscore.
$ Matches the string ending with zero or more characters.
public class RegExTest {
public static void main(String[] args) {
System.out.println("_C#".matches("^[a-zA-Z0-9_]+$"));
}
}
To check the entire string and not allow empty strings, try
^[A-Za-z0-9_]+$
This works for me. You can try:
[\\p{Alnum}_]
Required Format
Allow these three:
0142171547295
014-2171547295
123abc
Don't allow other formats:
validatePnrAndTicketNumber(){
let alphaNumericRegex=/^[a-zA-Z0-9]*$/;
let numericRegex=/^[0-9]*$/;
let numericdashRegex=/^(([1-9]{3})\-?([0-9]{10}))$/;
this.currBookingRefValue = this.requestForm.controls["bookingReference"].value;
if(this.currBookingRefValue.length == 14 && this.currBookingRefValue.match(numericdashRegex)){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else if(this.currBookingRefValue.length ==6 && this.currBookingRefValue.match(alphaNumericRegex)){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else if(this.currBookingRefValue.length ==13 && this.currBookingRefValue.match(numericRegex) ){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else{
this.requestForm.controls["bookingReference"].setErrors({'pattern': true});
}
}
<input name="booking_reference" type="text" [class.input-not-empty]="bookingRef.value"
class="glyph-input form-control floating-label-input" id="bookings_bookingReference"
value="" maxlength="14" aria-required="true" role="textbox" #bookingRef
formControlName="bookingReference" (focus)="resetMessageField()" (blur)="validatePnrAndTicketNumber()"/>

Capturing a repeated group

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.