Regular expression to restrict Extended ASCII character set - regex

I have multi lingual application which creates xml files but Extended ASCII characters from 168 to 254 (¿⌐¬½¼¡«»░▓│┤╡╢╖╕╣║╗╜╛┐└┴┬├) are not supposed in XML tags so, I would like to restrict user from entering.
I tried restricting everything besides alphanumeric, underscore and dash but it would not allow accented characters ó ç õ which are part of extended ASCII. Here is regx "^[a-zA-Z0-9\s.\-_]+$"
Second option was to create a string of all symbols from 168 to 254 and check if string contains any of them but not sure if it is reliable and accurate solution.
What is best way to filter input for Extended ASCII character set ?
Link to Extended ASCII character set table

Rather you can make use of range in character class, to exclude specific range of characters using their Hex Codes: -
[^\xA8-\xFE]
The above regex will match any character except those in the given range. Those are the hex codes for the range you posted - [168, 254]

Although #Oded suggest was applicable but I used following solution:
Dim filteredInput as string
Private const XML_RESTRICTED_CHARACTERS as string ="[☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼#$%&()*+,-./:;<=>?#[\]^_`¢£¥₧ƒªº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■""}{]"
filteredInput =Regex.Replace(strInput.ToLower(), XML_RESTRICTED_CHARACTERS, "")

Second option was to create a string of all symbols from 168 to 254 and check if string contains any of them but not sure if it is reliable and accurate solution.
Yes, this is a reliable and accurate solution. It is also more lightweight than regular expressions.

Related

Special characters in EBS Search Strings?

I am working on the EBS configuration side of the SAP ERP system where I am trying to define Search Strings for the MT940 format (as per SAP SPRO activity "Define Search String for Electronic Bank Statement", for instance see this blog post).
I am trying to create a search pattern that is able to identify special characters in the MT940 format, for example ?/!/>, etc.
My search pattern: \C*######\C*
The text that I use to identify the mapping:
:86:306?00CCY RECD?20/BI/**?651234?**/BO/DE652004ED
In this case, I defined:
\C* as to look for special characters - this will be skipped based on the mapping.
# to look for a sequence of 6 numbers.
My results from the test:
1 651234
2 652004
3 651234
4 652004
The result I look to achieve based on the search pattern defined: 651234
I do understand that the reason for having the repetition is because of the * symbol. However, if I skip adding that symbol, the search pattern will end up in error.
My problem is that I cannot seem to understand how can I translate special characters to be identified by the SAP Search Strings? Furthermore, how can I identify if it is a letter?
Below is the Search String definition from the SAP documentation of SPRO activity "Define Search String for Electronic Bank Statement":
String for searches in text. A search string consists of normal characters (that is, letters and digits) and other characters:
| Or
( ) Grouping
+ Repeats the previous character once or several times
* "Zero" or repeats the previous character several times
? Any individual character you want
# Any of the digits 0 to 9
^ Start of a line
$ End of a line
\ Escape symbol
Examples:
The search string "ab" fits each position in a character string in which the letter "b" follows the letter "a".
The search string "(A+|B)+C" "AC", "BC", "AAAAAC" or "ABAAC".
"(A+|B+)C fits "AC", "BC" and "AAAAAC", but not "ABAAC".
"\*C" fits "*C"; the effect of the escape symbol is that "*" is not interpreted as a special character.
This is the first time I raise a question, therefore, I want to apologize if the format is not correct or the text was too long.
Many thanks for your time and help!

Regex to exclude non-ASCII but keep Nordic characters

I have a macro in which I use Regex to strip a text of all non-ASCII characters (in order to create folder names).
I am relatively new to Regex and I was wondering how to strip all non-ASCII but still include Nordic characters, as the macro goes through Scandinavian data. Basically, I would need to include characters 128 to 165 from this table
Here is my code so far:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "[^\u0000-\u007F]"
GetStrippedText = regEx.Replace(txt, "")
End Function
I understand that I need to include this range in there somehow "[^\u0000-\u007F]", I just don't know where to find the associated code or how to include it.
To the best of my knowledge I think there are a few points here to highlight:
Not all extended (or non-) ASCII tables follow the same character encoding. The table you linked seems to follow CP437, and Excel follows UTF-8 (Unicode), which you can test using the UNICODE function in Excel. Here is a link to see the difference it makes in Hex-codes. So you most likely need to pick a range of interest within the "Latin-1 Supplement" which can be found here. For this exercise I went with characters from À-ÿ which is range: u00C0-\u00FF
Next, your current character class covers normal ASCII characters, however I believe you might just be interested in 0020-007F as you probably don't want to include 0000-001F.
Thirdly, you did not set the Global parameter to True which means your current UDF will only replace the first character it finds outside your character class. So you'll need to set this parameter to replace all characters outside defined character class.
So to conclude, the below might work for you:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Global = True
regEx.Pattern = "[^\u0020-\u007F\u00C0-\u00FF]"
GetStrippedText = regEx.Replace(txt, "")
End Function
For your understanding; [^\u0020-\u007F\u00C0-\u00FF] means:
[....] - The brackets tell us this is a character class
^ - The caret means it's a negated character class
\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.
In this same fashion you can extend the amount of character ranges.
Note1: Instead of Unicode, you could also just use the Hex codes: "[^\x20-\x7F\xC0-\xFF]"
Note2: You could also create a character class without Unicode or Hex ranges. Simply concatenate the characters of interest instead.

Can you restrict two characters based on their ASCII order in regex?

Let's say I have a string of 2 characters. Using regex (as a thought exercise), I want to accept it only if the first character has an ascii value bigger than that of the second character.
ae should not match because a is before e in the the ascii table.
ea, za and aA should match for the opposite reason
f$ should match because $ is before letters in the ascii table.
It doesn't matter if aa or a matches or not, I'm only interested in the base case. Any flavor of regex is allowed.
Can it be done ? What if we restrict the problem to lowercase letters only ? What if we restrict it to [abc] only ? What if we invert the condition (accept when the characters are ordered from smallest to biggest) ? What if I want it to work for N characters instead of 2 ?
I guess that'd be almost impossible for me to do it then, however bobble-bubble impressively solved the problem with:
^~*\}*\|*\{*z*y*x*w*v*u*t*s*r*q*p*o*n*m*l*k*j*i*h*g*f*e*d*c*b*a*`*_*\^*\]*\\*\[*Z*Y*X*W*V*U*T*S*R*Q*P*O*N*M*L*K*J*I*H*G*F*E*D*C*B*A*#*\?*\>*\=*\<*;*\:*9*8*7*6*5*4*3*2*1*0*\/*\.*\-*,*\+*\**\)*\(*'*&*%*\$*\#*"*\!*$(?!^)
bobble bubble RegEx Demo
Maybe for abc only or some short sequences we would approach solving the problem with some expression similar to,
^(abc|ab|ac|bc|a|b|c)$
^(?:abc|ab|ac|bc|a|b|c)$
that might help you to see how you would go about it.
RegEx Demo 1
You can simplify that to:
^(a?b?c?)$
^(?:a?b?c?)$
RegEx Demo 2
but I'm not so sure about it.
The number of chars you're trying to allow is irrelevant to the problem you are trying to solve:
because you can simply add an independent statement, if you will, for that, such as with:
(?!.{n})
where n-1 would be the number of chars allowed, which in this case would be
(?!.{3})^(?:a?b?c?)$
(?!.{3})^(a?b?c?)$
RegEx Demo 3
A regex is not the best tool for the job.
But it's doable. A naive approach is to enumerate all the printable ascii characters and their corresponding lower range:
\x21[ -\x20]|\x22[ -\x21]|\x23[ -\x22]|\x24[ -\x23]|\x25[ -\x24]|\x26[ -\x25]|\x27[ -\x26]|\x28[ -\x27]|\x29[ -\x28]|\x2a[ -\x29]|\x2b[ -\x2a]|\x2c[ -\x2b]|\x2d[ -\x2c]|\x2e[ -\x2d]|\x2f[ -\x2e]|\x30[ -\x2f]|\x31[ -\x30]|\x32[ -\x31]|\x33[ -\x32]|\x34[ -\x33]|\x35[ -\x34]|\x36[ -\x35]|\x37[ -\x36]|\x38[ -\x37]|\x39[ -\x38]|\x3a[ -\x39]|\x3b[ -\x3a]|\x3c[ -\x3b]|\x3d[ -\x3c]|\x3e[ -\x3d]|\x3f[ -\x3e]|\x40[ -\x3f]|\x41[ -\x40]|\x42[ -\x41]|\x43[ -\x42]|\x44[ -\x43]|\x45[ -\x44]|\x46[ -\x45]|\x47[ -\x46]|\x48[ -\x47]|\x49[ -\x48]|\x4a[ -\x49]|\x4b[ -\x4a]|\x4c[ -\x4b]|\x4d[ -\x4c]|\x4e[ -\x4d]|\x4f[ -\x4e]|\x50[ -\x4f]|\x51[ -\x50]|\x52[ -\x51]|\x53[ -\x52]|\x54[ -\x53]|\x55[ -\x54]|\x56[ -\x55]|\x57[ -\x56]|\x58[ -\x57]|\x59[ -\x58]|\x5a[ -\x59]|\x5b[ -\x5a]|\x5c[ -\x5b]|\x5d[ -\x5c]|\x5e[ -\x5d]|\x5f[ -\x5e]|\x60[ -\x5f]|\x61[ -\x60]|\x62[ -\x61]|\x63[ -\x62]|\x64[ -\x63]|\x65[ -\x64]|\x66[ -\x65]|\x67[ -\x66]|\x68[ -\x67]|\x69[ -\x68]|\x6a[ -\x69]|\x6b[ -\x6a]|\x6c[ -\x6b]|\x6d[ -\x6c]|\x6e[ -\x6d]|\x6f[ -\x6e]|\x70[ -\x6f]|\x71[ -\x70]|\x72[ -\x71]|\x73[ -\x72]|\x74[ -\x73]|\x75[ -\x74]|\x76[ -\x75]|\x77[ -\x76]|\x78[ -\x77]|\x79[ -\x78]|\x7a[ -\x79]|\x7b[ -\x7a]|\x7c[ -\x7b]|\x7d[ -\x7c]|\x7e[ -\x7d]|\x7f[ -\x7e]
Try it online!
A (better) alternative is to enumerate the ascii characters in reverse order and use the ^ and $ anchors to assert there is nothing else unmatched. This should work for any string length:
^\x7f?\x7e?\x7d?\x7c?\x7b?z?y?x?w?v?u?t?s?r?q?p?o?n?m?l?k?j?i?h?g?f?e?d?c?b?a?`?\x5f?\x5e?\x5d?\x5c?\x5b?Z?Y?X?W?V?U?T?S?R?Q?P?O?N?M?L?K?J?I?H?G?F?E?D?C?B?A?#?\x3f?\x3e?\x3d?\x3c?\x3b?\x3a?9?8?7?6?5?4?3?2?1?0?\x2f?\x2e?\x2d?\x2c?\x2b?\x2a?\x29?\x28?\x27?\x26?\x25?\x24?\x23?\x22?\x21?\x20?$
Try it online!
You may replace ? with * if you want to allow duplicate characters.
ps: some people can come up with absurdly long regexes when they aren't the right tool for the job: to parse email, html or the present question.

simple regex to matching multiple word with spaces/multiple space or no spaces and special characters

I have a string that is delimited by a comma.
The first 3 fields are static.
Fields 4-20 are dynamic and can contain any string even if it has special characters but cannot be empty.
Field 21 is static
Field 22 is dynamic and can contain any string even if it has special characters.
Fields 23,24 are static.
I need to make sure the string matches the above criteria and is a match, but am wondering on how to make fields 4-20 have the option of containing the special characters and not be blank. (Total of 17 between 4-20)
If I remove the requirement of the special characters this seems to work:
Field1\,Field2\,Field3\,+([\w\s\,]+)F21/C\,[\w\s\,]+(F/23\,)(Field24)
with this string
Field1,Field2,Field3,F4,f5,6f 1,f72,f8,F9,F10,F1,f12,f13,f14,f15,f16,f17,f18,f19,f20,F21/C,F22,F/23,Field24
Is there a way to accomplish this with fields 4-20 having special characters and not being empty like "" or " " or am I pushing it too far?
I know I can parse it through c# but I'm experimenting with Regex and it seems pretty powerful.
Thanks
I did not fully understand the problem
But I think that's what you want bottom line:
s1,s2,s3,([^ ,]+,){17}s21,[^ ,]+,s23,s24
replace the sX to relevant static fields.
example:
https://regex101.com/r/EaAPKH/1

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101