Regex to exclude non-ASCII but keep Nordic characters - regex

I have a macro in which I use Regex to strip a text of all non-ASCII characters (in order to create folder names).
I am relatively new to Regex and I was wondering how to strip all non-ASCII but still include Nordic characters, as the macro goes through Scandinavian data. Basically, I would need to include characters 128 to 165 from this table
Here is my code so far:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "[^\u0000-\u007F]"
GetStrippedText = regEx.Replace(txt, "")
End Function
I understand that I need to include this range in there somehow "[^\u0000-\u007F]", I just don't know where to find the associated code or how to include it.

To the best of my knowledge I think there are a few points here to highlight:
Not all extended (or non-) ASCII tables follow the same character encoding. The table you linked seems to follow CP437, and Excel follows UTF-8 (Unicode), which you can test using the UNICODE function in Excel. Here is a link to see the difference it makes in Hex-codes. So you most likely need to pick a range of interest within the "Latin-1 Supplement" which can be found here. For this exercise I went with characters from À-ÿ which is range: u00C0-\u00FF
Next, your current character class covers normal ASCII characters, however I believe you might just be interested in 0020-007F as you probably don't want to include 0000-001F.
Thirdly, you did not set the Global parameter to True which means your current UDF will only replace the first character it finds outside your character class. So you'll need to set this parameter to replace all characters outside defined character class.
So to conclude, the below might work for you:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Global = True
regEx.Pattern = "[^\u0020-\u007F\u00C0-\u00FF]"
GetStrippedText = regEx.Replace(txt, "")
End Function
For your understanding; [^\u0020-\u007F\u00C0-\u00FF] means:
[....] - The brackets tell us this is a character class
^ - The caret means it's a negated character class
\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.
In this same fashion you can extend the amount of character ranges.
Note1: Instead of Unicode, you could also just use the Hex codes: "[^\x20-\x7F\xC0-\xFF]"
Note2: You could also create a character class without Unicode or Hex ranges. Simply concatenate the characters of interest instead.

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

simple regex to matching multiple word with spaces/multiple space or no spaces and special characters

I have a string that is delimited by a comma.
The first 3 fields are static.
Fields 4-20 are dynamic and can contain any string even if it has special characters but cannot be empty.
Field 21 is static
Field 22 is dynamic and can contain any string even if it has special characters.
Fields 23,24 are static.
I need to make sure the string matches the above criteria and is a match, but am wondering on how to make fields 4-20 have the option of containing the special characters and not be blank. (Total of 17 between 4-20)
If I remove the requirement of the special characters this seems to work:
Field1\,Field2\,Field3\,+([\w\s\,]+)F21/C\,[\w\s\,]+(F/23\,)(Field24)
with this string
Field1,Field2,Field3,F4,f5,6f 1,f72,f8,F9,F10,F1,f12,f13,f14,f15,f16,f17,f18,f19,f20,F21/C,F22,F/23,Field24
Is there a way to accomplish this with fields 4-20 having special characters and not being empty like "" or " " or am I pushing it too far?
I know I can parse it through c# but I'm experimenting with Regex and it seems pretty powerful.
Thanks
I did not fully understand the problem
But I think that's what you want bottom line:
s1,s2,s3,([^ ,]+,){17}s21,[^ ,]+,s23,s24
replace the sX to relevant static fields.
example:
https://regex101.com/r/EaAPKH/1

Regular expression to restrict Extended ASCII character set

I have multi lingual application which creates xml files but Extended ASCII characters from 168 to 254 (¿⌐¬½¼¡«»░▓│┤╡╢╖╕╣║╗╜╛┐└┴┬├) are not supposed in XML tags so, I would like to restrict user from entering.
I tried restricting everything besides alphanumeric, underscore and dash but it would not allow accented characters ó ç õ which are part of extended ASCII. Here is regx "^[a-zA-Z0-9\s.\-_]+$"
Second option was to create a string of all symbols from 168 to 254 and check if string contains any of them but not sure if it is reliable and accurate solution.
What is best way to filter input for Extended ASCII character set ?
Link to Extended ASCII character set table
Rather you can make use of range in character class, to exclude specific range of characters using their Hex Codes: -
[^\xA8-\xFE]
The above regex will match any character except those in the given range. Those are the hex codes for the range you posted - [168, 254]
Although #Oded suggest was applicable but I used following solution:
Dim filteredInput as string
Private const XML_RESTRICTED_CHARACTERS as string ="[☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼#$%&()*+,-./:;<=>?#[\]^_`¢£¥₧ƒªº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■""}{]"
filteredInput =Regex.Replace(strInput.ToLower(), XML_RESTRICTED_CHARACTERS, "")
Second option was to create a string of all symbols from 168 to 254 and check if string contains any of them but not sure if it is reliable and accurate solution.
Yes, this is a reliable and accurate solution. It is also more lightweight than regular expressions.

Regex replace filename in javascript

I'm having trouble with a regular expression, I have several images with file name that need changing. I've done them by hand. It was quick easy and painless. However, I wanted to know what I needed to do as a simple replacement reg ex using JavaScript. And that's when it doesn't quite work out. The image is called "muti blossom 02.png" and it's going to be re-sized and saved out as JPEGs with the name "iOS_multi_BLOSSOM_2048.jpg". The others are of the same form but have different nouns; winter, leaf, circus etc.
The file-name is structured as follows:
"mutli" at the start (lower case),
white space,
the noun (lower case),
white-space,
a number (that may have a preceding 0 and may be one or two digits),
file extension which may be .png or .psd (lowercase).
It then needs to be changed to:
iOS_multi (camel case as written),
noun (UPPERCASE),
2048 (new fixed size),
new file extension .jpg(lowercase).
I know that ([a-z]+\s) matches "multi" and that (\s\d+.[a-z]+$) will match the numbers and file extension, but have no idea how to successfully match the bit in the middle as well. And do the uppercase on the noun. But I'm sure there is someone else that does. Thank you.
In JavaScript regex you cannot do this with a replace as it is not possible to uppercase the replacement text. However the match method will return an array which you can then manipulate.
var oldImageName = "multi blossom 02.png";
var matches = oldImageName.match(/multi (\w+) \d{1,2}\.(?:png|psd)/);
var newImageName = "iOS_multi_" + matches[1].toUpperCase() + "_2048.jpg";
Note: this assumes that the "noun" is a single word with no spaces
I was searching for "javascript Regex to replace characters that Windows doesn't accept in a filename" but found nothing,
so here is regex to strip chars from filename that windows filesistem do not allow (/\:?<>|"):
var originalFileName='some filename:with"forbidden/>\? chars.in';
var strippedFileName=originalFileName.replace(/[/\\:?<>|\"]+/g, "")
console.log(strippedFileName);

Strategy to replace spaces in string

I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?
Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)
I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.
You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!
With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.
Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");
I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).
I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');