Wide character in print when involve using special characters - regex

I want to split a long sentence with the dot . character as long as the dot is not wrapped in any kinds of brackets, like (), (), 【】, 〔〕, etc. and there should be at least three words on its left. I use the following code. But it gives the Wide character in print error.
my $a = "hi hello world. "
$a .= "【 hi hello world. 】";
my #list = split /(?<!\.)(?<=(?:[\w'’]{1,30} ){2}[\w'’]{1,30})\. (?![^()〈〉【】()\[\]〔〕\{\}]*[\))\]〉】〕\}])/, $a;
The expected result would be $a splits into:
hi hello world
and
【 hi hello world. 】
I'm using perl v5.31.3 on macOS Big Sur.
p.s. In the project, I'm also using XML::LibXML::Reader. I'm not sure whether adding use utf8::all; is allowed.

Decode your inputs, encode your outputs
The warning is the result of your attempt to write something other than bytes[1] to a file handle.
You need to encode your outputs, either explicitly, or by adding an encoding layer to the file handle.
use open ':std', ':encoding(UTF-8)';
If your source code is encoded using UTF-8, you need to tell perl that by using use utf8;. Otherwise, it assumes the source code is encoded using ASCII.[2]
If you accept arguments, these are also inputs that need to be decoded. You can use the following:
use Encode qw( decode_utf8 );
#ARGV = map { decode_utf8($_) } #ARGV;
For this purpose, a byte is a value between 0 and 255 inclusive. And since we're talking about printing, we're talking about a character (which is to say string element) with such a value.
Although string and regex literals are 8-bit clean.

Related

Odd substitution behaviour in perl substitution of rtf file

I am trying to use the perl module "RTF::Writer" for strings of text that must be a mix of formats. This is proving more complicated than I anticipated. I am just trying a test at the moment with:
$rtf->paragraph( \'\b', "Name: $name, le\cf1 ng\cf0 th $len" );
but this writes:
{\pard
\b
Name: my_name, le\'061 ng\'060 th 7
\par}
where \'061 should be \cf1 and \'060 should be \cf0.
I then tried to remedy this with a perl 1-liner:
perl -pi -e "s/\'06/\cf/g"
but this made things worse, I do not know what "\^F" represents in vi, but that is what it shows.
It did not matter if I escaped the backslashes or not.
Can anyone explain this behavior, and what to do about it?
Can anyone suggest how to get the RTF::Writer to create the file as desired from the start?
Thanks
\ is a special character in double-quoted string literals. If you want a string that contains \, you need to use \\ in the literal. To create the string \cf1, you need to use "\\cf1". ("\cf" means Ctrl-F, which is to say the byte 06.)
Alternatively, \ is only special if followed by \ or a delimiter in single-quoted string literals. So the string \cf1 could also be created from '\cf1'.
Both produce the string you want, but they don't produce the document you want. That's because there's a second problem.
When you pass a string to RTF::Writer, it's expected to be text to render. But you are passing a string you wanted included as is in the final document. You need to pass a reference to a string if you want to provide raw RTF. \'...', \"..." and \$str all produce a reference to a string.
Fixed:
use RTF::Writer qw( );
my $name = "my_name";
my $rtf = RTF::Writer->new_to_file("greetings.rtf");
$rtf->prolog( 'title' => "Greetings, hyoomon" );
$rtf->paragraph( \'\b', "Name: $name, le", \'\cf1', "ng", \'\cf0', "th".length($name));
$rtf->close;
Output from the call to paragraph:
{\pard
\b
Name: my_name, le\cf1
ng\cf0
th7
\par}
Note that I didn't use the following because it would be code injection bug:
$rtf->paragraph(\("\\b Name: $name, le\\cf1 ng\\cf0 th".length($name)));
Don't pass text such as the contents of $name using \...; use that for raw RTF only.

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

How do I implement IPv6 zero compression in Powershell?

For example: here is an IP address : fc06:0000:0002:0760:e000:7201:7003:06b7.
But i'd like to replace ":0" or :and more zeros to ":". So it should look like this: fc06::2:760:e000:7201:7003:6b7
The code:
if (Test-Path $file){
foreach ($line in Get-Content $file){
$line=$line.Replace(":0+",":")
Write-Output $line
}
}
I know, that the problem is with :0+. So: how can I say, that one or more from the last character (now: zero)? Because simply :0+ doesn't works.
Thanks!
Use the built-in [System.Net.IPAddress] class:
$ip = [System.Net.IPAddress]"fc06:0000:0002:0760:e000:7201:7003:06b7"
$ip.IPAddressToString
or
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0002:0760:e000:7201:7003:06b7")
$ip.IPAddressToString
I recommend this over using string replacement because it will deal with the issue of multiple groups of zeroes in different parts of the address, preventing you from creating an invalid IPv6 address. It will not collapse the first consecutive zeros into :: though.
Edit
Actually, this method will use :: if there are more than 1 group of consecutive zeroes, for example:
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0000:0760:e000:7201:0000:06b7")
$ip.IPAddressToString
Will give you fc06::760:e000:7201:0:6b7.
Edit 2
Here's a way to get exactly what you want:
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0002:0760:e000:7201:7003:06b7")
$ip.IPAddressToString -replace '(?<!(?:^|:)0?:.*?)(?:^|:)0:','::'
This lets the class do the heavy lifting, and in case there's a single zero that could be collapsed, it takes care of it.
Output: fc06::2:760:e000:7201:7003:6b7
(also tested with other addresses that have different sizes of zero blocks in different positions).
Edit 3:
As #Ron Maupin points out, RFC 5952 section 4.2.2 states:
The symbol "::" MUST NOT be used to shorten just one 16-bit 0 field.
For example, the representation 2001:db8:0:1:1:1:1:1 is correct, but
2001:db8::1:1:1:1:1 is not correct.
Given this, I strongly recommend that you don't use the code in Edit 2, or otherwise attempt to collapse a single 16 bit field.
Instead, use the built-in class's zero compression which appears to already be RFC compliant.
The string .replace() method uses literal string arguments, not regular expressions.
Switch to the -replace operator:
$line=$line -Replace ':0+',':'

How to detect Arabic chars using perl regex?

I'm parsing some html pages, and need to detect any Arabic char inside..
Tried various regexs, but no luck..
Does anyone know working way to do that?
Thanks
Here is the page I'm processing: http://pastie.org/2509936
And my code is:
#!/usr/bin/perl
use LWP::UserAgent;
#MyAgent::ISA = qw(LWP::UserAgent);
# set inheritance
$ua = LWP::UserAgent->new;
$q = 'pastie.org/2509936';;
$request = HTTP::Request->new('GET', $q);
$response = $ua->request($request);
if ($response->is_success) {
if ($response->content=~/[\p{Script=Arabic}]/g) {
print "found arabic";
} else {
print "not found";
}
}
If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/
If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.
EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.
The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is
[update: the server does return the header Content-Type: text/html; charset=utf-8, though]).
You can see if this if you examine $response->content:
use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";
If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:
use Encode;
my $content = decode('utf-8', $response->content);
# now check $content =~ /\p{Arabic}/
Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.
Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.
Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.
Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".
I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers
What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.
My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}
What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.