Odd substitution behaviour in perl substitution of rtf file

Odd substitution behaviour in perl substitution of rtf file - regex

I am trying to use the perl module "RTF::Writer" for strings of text that must be a mix of formats. This is proving more complicated than I anticipated. I am just trying a test at the moment with:
$rtf->paragraph( \'\b', "Name: $name, le\cf1 ng\cf0 th $len" );
but this writes:
{\pard
\b
Name: my_name, le\'061 ng\'060 th 7
\par}
where \'061 should be \cf1 and \'060 should be \cf0.
I then tried to remedy this with a perl 1-liner:
perl -pi -e "s/\'06/\cf/g"
but this made things worse, I do not know what "\^F" represents in vi, but that is what it shows.
It did not matter if I escaped the backslashes or not.
Can anyone explain this behavior, and what to do about it?
Can anyone suggest how to get the RTF::Writer to create the file as desired from the start?
Thanks

\ is a special character in double-quoted string literals. If you want a string that contains \, you need to use \\ in the literal. To create the string \cf1, you need to use "\\cf1". ("\cf" means Ctrl-F, which is to say the byte 06.)
Alternatively, \ is only special if followed by \ or a delimiter in single-quoted string literals. So the string \cf1 could also be created from '\cf1'.
Both produce the string you want, but they don't produce the document you want. That's because there's a second problem.
When you pass a string to RTF::Writer, it's expected to be text to render. But you are passing a string you wanted included as is in the final document. You need to pass a reference to a string if you want to provide raw RTF. \'...', \"..." and \$str all produce a reference to a string.
Fixed:
use RTF::Writer qw( );
my $name = "my_name";
my $rtf = RTF::Writer->new_to_file("greetings.rtf");
$rtf->prolog( 'title' => "Greetings, hyoomon" );
$rtf->paragraph( \'\b', "Name: $name, le", \'\cf1', "ng", \'\cf0', "th".length($name));
$rtf->close;
Output from the call to paragraph:
{\pard
\b
Name: my_name, le\cf1
ng\cf0
th7
\par}
Note that I didn't use the following because it would be code injection bug:
$rtf->paragraph(\("\\b Name: $name, le\\cf1 ng\\cf0 th".length($name)));
Don't pass text such as the contents of $name using \...; use that for raw RTF only.

Related

Is there a Perl regex metacharacter or a way to have specify a default value, if a subpattern capture does not match?

Here's the idea. I am parsing command line options but doing it across the entire command line, not by each #ARGV element separately.
program --format="%H:%M:%S" --timeout 12 --nofail
I want the parsing to work with these cases.
--name=value, easy to parse
--name value, pretty easy
--name no value, default the value to 1
Here is the regex which works, except it cannot do the missing value case
%options = "#ARGV" ~= /--([A-Za-z]+)[= ]([^-]\S*)/g;
i.e. match --name=value or --name value but not --name --name, --name --name is two names, not a --name=value pair.
If a --name has no value following it that matches the second capture in the regex, is there a way, within the regex, to specify a default, in my case a 1, to indicate "true". i.e. if an --name has no argument, like --nofail then set that argument to 1 indicating true.
Actually, in asking this I figured out a workaround using separate match statements which is fine. However, just out of curiosity, the question still stands, is there a Perl regex way to have a default if a submatch fails?

I don't see how to return a list reflecting a changed input from a regex alone. To change the input we need s{}{}er operator, as we need code in its replacement part to analyze captures and decide what to change; and, we get a string, not a list, which need be further processed (split).
Here is then one such take, with a minimal intrusion of code.
Match name and value, with = or space between them, and if value ($2) is undefined give it a value; so we need /e to implement that.† Once we are at it, put a space between all name-value pairs. This goes under /r so that the changed string is returned, and passed through split
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1.' '.($2//'7 ') }ergx;
The split can be done by another regex instead but that's still extra processing.
A complete program (with more flags added to the input)
use warnings;
use strict;
use feature 'say';
my $args = shift // q(--fmt="%H:%M" --f1 --time 12 --f2 --f3);
say $args;
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1 . ' ' . ($2//'1 ') }ergx;
say "$_ => $arg{$_}" for keys %arg;
This prints as expected. But note that there may be edge cases, and in particular having a space inside (a quoted) argument value, like "%H %M", would require a far more complex pattern.
I presume that the regex ask is for play/study. Normally this goes by libraries, like Getopt::Long. If that is somehow not possible then processing #ARGV term by term is nice and easy -- and fast.
† In order to actually do "if value ($2) is undefined give it a value" we need to run code in the replacement part, what is done under the /e modifier

Wide character in print when involve using special characters

I want to split a long sentence with the dot . character as long as the dot is not wrapped in any kinds of brackets, like (), （）, 【】, 〔〕, etc. and there should be at least three words on its left. I use the following code. But it gives the Wide character in print error.
my $a = "hi hello world. "
$a .= "【 hi hello world. 】";
my #list = split /(?<!\.)(?<=(?:[\w'’]{1,30} ){2}[\w'’]{1,30})\. (?![^()〈〉【】（）\[\]〔〕\{\}]*[\)）\]〉】〕\}])/, $a;
The expected result would be $a splits into:
hi hello world
and
【 hi hello world. 】
I'm using perl v5.31.3 on macOS Big Sur.
p.s. In the project, I'm also using XML::LibXML::Reader. I'm not sure whether adding use utf8::all; is allowed.

Decode your inputs, encode your outputs
The warning is the result of your attempt to write something other than bytes[1] to a file handle.
You need to encode your outputs, either explicitly, or by adding an encoding layer to the file handle.
use open ':std', ':encoding(UTF-8)';
If your source code is encoded using UTF-8, you need to tell perl that by using use utf8;. Otherwise, it assumes the source code is encoded using ASCII.[2]
If you accept arguments, these are also inputs that need to be decoded. You can use the following:
use Encode qw( decode_utf8 );
#ARGV = map { decode_utf8($_) } #ARGV;
For this purpose, a byte is a value between 0 and 255 inclusive. And since we're talking about printing, we're talking about a character (which is to say string element) with such a value.
Although string and regex literals are 8-bit clean.

How to remove backslashes from QString?

Using QNetworkManager get method I am receiving a json from a url.
Doing: qDebug()<<(QString)reply->readAll(); the result is:
"\r\n[{\"id\":\"1\",\"name\":\"Jhon\",\"surname\":\"Snow\",\"phone\":\"358358358\"}]"
So I am doing strReply = strReply.simplified(); , and the result is:
"[{\"id\":\"1\",\"name\":\"Jhon\",\"surname\":\"Snow\",\"phone\":\"358358358\"}]"
But I can't use that to parse it like a Json to use it in my qt program.
So I think I need to remove every backslashes \ and obtain:
"[{"id":"1","name":"Jhon","surname":"Snow","phone":"348348348"}]"
I tried strReply.remove(QRegExp( "\\\" ) ); but any odd concatenation of \ is causing the interpreter to think at every thing that comes after the last \ as a string.

You're probably running into qDebug's feature that escapes quotes and newlines. Your string most probably doesn't actually have any backslashes in it.
When you're trying to print a string using qDebug(), you need to use qDebug().noquote() if you don't want qDebug() to artificially insert backslashes in the output.
So your string should be fine. It doesn't have any backslashes in it at all.

As described in the documentation You can remove a character with remove function
QString t = "Ali Baba";
t.remove(QChar('a'), Qt::CaseInsensitive);
// Will result "li Bb"
You can put '\\' instead of 'a' to remove your backslashes from your QString

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!

If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo

We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.

Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.

Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".

I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers

What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.

My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}

What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Odd substitution behaviour in perl substitution of rtf file - regex

Related

Is there a Perl regex metacharacter or a way to have specify a default value, if a subpattern capture does not match?

Wide character in print when involve using special characters

How to remove backslashes from QString?

Extract a text string with regex

what do I use to match MS Word chars in regEx

Categories

Resources