Perl Regular Expression cannot find fancy quotes “ - regex

I was trying to find the fancy quotes “ from a string using the following Perl regular expression but it returns false.
$text = "NBN “a joint venture with Telstra”";
if ($text =~ m/“/)
{
print "found";
}
I also tried using "\x93" ascii code but still does not work. I am stuck here.
Any help is appreciated.
Regards,
Allen

Depending on the encoding of the string you are trying to match, you might need to do different things. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
If the input string is encoded in UTF-8, then you need to specify that encoding in your perl script - one way to do that is with use encoding 'UTF-8'.
You can also specify use utf8 if you want the encoding of the script itself to be UTF-8. You are probably better off, though, knowing the code point of the character you are checking for, and specifying it directly:
use utf8;
use encoding 'UTF-8';
$text = "NBN “a joint venture with Telstra”"; # Make sure to quote this string properly
if ($text =~ m/\N{U+201C}/) # “ is the same as U+201C LEFT DOUBLE QUOTATION MARK
{
print "found";
}

See the "Demoroniser" and for your specific problem, the discussion of just the "smart" quotes bit of it on Perlmonks Re^3: Reg Ex to strip MS smart quotes.
This advice is assuming - perhaps incorrectly - that your database's "fancy quotes" have come from some piece of Microsoft software producing Windows-1252 encoded text - if you've got UTF-8 instead, Avi's already pointed you in the right direction.

I recently came across some smart quotes which I couldn't eliminate using the regex-es mentioned in the above posts only. I had to do a trick which I found out entirely by trial and error:
First convert to iso-8859-1 using Encode::encode.
Next, convert the fancy quotes (using the 4 regular expressions mentioned above).
Next convert the string to UTF-8 using Encode::encode (I needed this since I was using the string in an iOS app and reading it from a SQLite database using “NSString stringWithUTF8String:” - may not be relevant to you).
Hope this helps someone.

Related

Powershell: Replace all occurrences of different substrings starting with same Unicode char (Regex?)

I have a string:
[33m[TEST][90m [93ma wonderful testorius line[90m ([37mbite me[90m) which ends here.
You are not able to see it (as stackoverflow will remove it when I post it) but there is a special Unicode char before every [xxm where xx is a variable number and [ as well as m are fixed. You can find the special char here: https://gist.githubusercontent.com/mlocati/fdabcaeb8071d5c75a2d51712db24011/raw/b710612d6320df7e146508094e84b92b34c77d48/win10colors.cmd
So, it is like this (the special char is displayed here with a $):
$[33m[TEST]$[90m $[93ma wonderful testorius line$[90m ($[37mbite me$[90m) which ends here.
Now, I want to remove all $[xxm substrings in this line as it is only for colored monitor output but should not be saved to a log file.
So the expected outcome should be:
[TEST] a wonderful testorius line (bite me) which ends here.
I tried to use RegEx but I dont understand it (perhaps it is extra confusing due to the special char and the open bracked) and I am not able to use wildcards in a normal .Replace ("this","with_that") operation.
How am I able to accomplish this?
In this simple case, the following -replace operation will do, but note that this is not sufficient to robustly remove all variations of ANSI / Virtual Terminal escape sequences:
# Sample input.
# Note: `e is used as a placeholder for ESC and replaced with actual ESC chars.
# ([char] 0x1b)
# In PowerShell (Core) 7+, "..." strings directly understand `e as ESC.
$formattedStr = '`e[33m[TEST]`e[90m `e[93ma wonderful testorius line`e[90m (`e[37mbite me`e[90m) which ends here.' -replace '`e', [char] 0x1b
# \x1b is a regex escape sequence that expands to an ESC char.
$formattedStr -replace '\x1b\[\d*m'
Generally speaking, it's advisable to look for options on programs producing such for-display-formatted strings to make them output plain-text strings instead, so that the need to strip escape sequences after the fact doesn't even arise.

Regex Error - (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.
urls.each do |url|
response = HTTP.get(url)
if response.status.success?
source_code = response.to_s
# Remove comments
source_code = source_code.gsub(/<!--(.*?)-->/su, '')
# Remove scripts
source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')
if source_code.match(/out of stock/i)
# Flag URL for further processing
end
end
end
end
This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:
Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))
The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.
If anyone could help with this it would be hugely appreciated. Thank you!
The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:
source_code.force_encoding(Encoding::UTF_8)
If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.
Also as #WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

How can I make a case-insensitive regexp match for Russian letters?

I have list of catalog paths and need to filter out some of them. My match pattern is in a non-Unicode encoding.
I tried the following:
require 5.004;
use POSIX qw(locale_h);
my $old_locale = setlocale(LC_ALL);
setlocale(LC_ALL, "ru_RU.cp1251");
#{$data -> {doc_folder_rights}} =
grep {
# catalog path pattern in $_REQUEST{q}
$_->{doc_folder} =~/$_REQUEST{q}/i;
}
#{$data -> {doc_folder_rights}};
setlocale(LC_ALL, $old_locale);
What I need is case-insensitive regexp pattern matching when pattern contains russsian letters.
There are several (potential) issues with your code:
Your code filters out all doc_folders that do not match the regexp in $_REQUEST{q}, however the question suggests that you want to do the opposite.
You might have an encoding issue. Setting the locale (using setlocale) changes the perl's handling of upper- & lower-case-conversions, but it does not change any encoding. You need to assure that $_REQUEST{q} is interpreted correctly.
For simplicity you can assume that any Perl-string contains Unicode-data in some internal representation that you need not know about in detail. Only when Perl does I/O there is an implicit or explicit conversion. When reading from stdin, ARGV or environment, Perl assumes that the bytes are encoded using the current locale and implicitly converts.
If you have an encoding issue, there are several ways to fix it:
Fix the environment in which Perl runs so that it knows about the correct locale from the very start. That will fix the implicit conversion.
In the unlikely case that $_REQUEST is loaded from a filehandle, you could explicitly tell Perl to convert using binmode($fh, ":encoding(cp1251)");. Do that prior the reading $_REQUEST.
There is the $string = Encode::decode(Encoding, $octets) function that tells Perl to forget its assumption about the encoding of $octets and instead treat the contents of $octets as byte-stream that needs to be converted to Unicode using Encoding. You need to do that before touching the contents of $octets, or strange things may happen.
Since $_REQUEST was probably loaded by some cgi-module, and was probably url-encoded in transit, you could just tell the cgi-module how to correctly do the decoding.

RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};
Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings
To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.
I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1