I'm writing a little program in C++, and come across a strange error:
src/Makefile/Tool.cpp:42:3: error: stray ‘\302’ in program
src/Makefile/Tool.cpp:42:3: error: stray ‘\240’ in program
I'm writing this program in Vim and the corresponding line (showing hidden characters) is:
>--->---std::vector<std::string> { "--debug" }$
This question is not about resolving this error, as I just have to copy back the line and the error-cause disappear.
It seems that the error is caused by some characters even hidden by Vim after activating all relative options!
The question is about what could have caused those errors.
"\302\240" is UTF-8 for U+00A0 NO-BREAK SPACE. Vim won’t normally highlight it as anything special, so it’s possible for one to sneak in even if you have 'list' mode enabled.
You can highlight them with:
:set listchars+=nbsp:.
or any character you like.
As aforementioned, it is due to some not visible characters in your source.
One great solution for this is to edit your file in octal mode and you will be able "to see" these characters:
od -c MyClass.hpp
Then you can see the "strangers" in the octal flow:
0001240 t s t r i n g & n a m e )
0001260 { **302 240** t h i s - > n a m e =
0001300 n a m e ; } \n \n \n \t \t \t \t /
These two characters in bold are the cause of messages like:
error: stray ‘\302’ in program
You can then remove them, and rebuild.
For me this problem came from copying my code over from a web browser.
I tried doing a :%s/ / /g in Vim to replace all spaces with real spaces, but this has failed.
I deleted the leading white space of all lines reporting this error and inserted the space characters again. This is labour intensive, but appears to be the only solution I could find.
I had the same issue and it was the character encoding for the spaces before each line. This happened due to copying from notes programs that was synced up with Exchange Server and iCloud. All I needed to do is apply and replace all using Notepad to all the strange spaces with normal ones and everything compiled normally again.
Related
This question already has answers here:
Why does a % percentage appear at the end of the output to standard out but not when I use endl? [duplicate]
(2 answers)
Closed 20 days ago.
I am executing some simple C++ code in VSCode on a MAC + code runner.
There are no warnings or compile errors, but in the terminal output window there is this annoying '%' symbol appended, as you can see below:
What causes it, and how can i get rid of it?
The inverse+bold % indicates a lack of '\n' at the end of the line. zsh has this to let you see unterminated lines in a command's output. You can get rid of it by using std::endl like so:
std::cout<< z << std::endl;
// or
std::cout<< z << '\n';
Keep in mind that std::endl flushes the output buffer and '\n' doesn't.
From zsh doc:
PROMPT_SP <D>
Attempt to preserve a partial line (i.e. a line that did not end with a newline) that would otherwise be covered up by the command prompt due to the PROMPT_CR option. This works by outputting some cursor-control characters, including a series of spaces, that should make the terminal wrap to the next line when a partial line is present (note that this is only successful if your terminal has automatic margins, which is typical).
When a partial line is preserved, by default you will see an inverse+bold character at the end of the partial line: a % for a normal user or a # for root. If set, the shell parameter PROMPT_EOL_MARK can be used to customize how the end of partial lines are shown.
NOTE: if the PROMPT_CR option is not set, enabling this option will have no effect. This option is on by default.
I use a translator tool to translate English into Simplified Chinese.
Now there is an issue with the period.
In English at the finish point of a sentence, we use full stop "."
In Simplified Chinese, it is "。"which looks like a small circle.
The translation tool mistakenly add this "small circle" / full stop to every major subtitles.
Is there a way to use Regex or other methods to scan the translated content, and replace any "small circle" / Chinese full stop symbol when the line has only 20 characters or less?
Some test data like below
<h1>这是一个测试。<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试。
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
It shall turn into:
<h1>这是一个测试<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
Difference:
Line 1 it only has 10 characters, and shall have Chinese full stop removed.
Line 4 is a sub heading, it only has 4 characters, and shall have full stop removed too.
By the way, I was told 1 Chinese word is two English characters.
Is this possible?
I'm using the approach 2
Second: maybe this one is more accurate: if there is no comma in this line, it should not have a full stop.
to determine whether a full stop 。 should be removed.
Regex
/^(?=.*。)(?!.*,)([^。]*)。/mg
^ start of a line
(?=.*。) match a line that contains 。
(?!.*,) match a line that doesn't contain ,
([^。]*)。 anything that not a full stop before a full stop, put it in group 1
Substitution
$1
Check the test cases here
But do mind this only removes the first full stop.
If you want to remove all the full stops, you can try (?:\G|^)(?=.*。)(?!.*,)(.*?)。 but this only works for regex engines supports \G such as pcre.
Also, if you want to combine the two approaches(a line has no period , and the length is less than 20 characters), you can try ^(?=.{1,20}$)(?=.*。)(?!.*,)([^。]*)。
I frequently encounter malformed utf-8 characters that breaks my codes. I have read some (not all) related questions/answers on stackoverflow, but nothing specific to Raku/perl6. Is there a fast way to remove these pesky characters from strings? The predefined character classes in "https://docs.raku.org/language/regexes#Predefined_character_classes" just won't do it:
Example: from REPL:
> say "â " ~~ /\w/ # you have to have a space following the "a" with "^" for it to work
「â」
> say "�" ~~ /\w/ # without the space, the character doesn't look normal
Malformed UTF-8 at line 1 col 6
> say "â ".chars # looks like 2 chars, but it says 1 char
1
> say "â ".comb.[0] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0 ] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0] # there is a space following ']' or it won't work
â
> say "â".comb.[0 ] # very strange, must have space before ']'
â
> say "â".comb
(â)
> say "â".comb.[0] .ord # # same here, very strange, it makes space precede the cursor
226
> my $a = Buf.new(226)
Buf:0x<E2>
> say $a.decode
Malformed termination of UTF-8 string
in block <unit> at <unknown file> line 1
> say $a.decode('utf8-c8')
xE2
> for #$a { say $_.chr; }
â
> say (#$a).elems
1
> say "â " ~~ / <alpha> / # again, must have space in the quote
「â」
alpha => 「â」
> say "â " ~~ / <cntrl> /
Nil
This is very troublesome. How to remove these non-utf8 chars? Is there a predefined character class for all good utf-8 chars or for good ASCII chars that are model citizens?
Hopefully someone will have a better answer. In the meantime...
There are several very different things going on in your question.
Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?
There is supposed to be a nice, obvious, fairly simple one:
say .decode: replacement => '�'
given $buf-that's-supposed-to-be-utf8
This should decode the same way a plain slurp does, except that, instead of just giving up on the decode when it encounters "Malformed UTF-8", it should just replace malformed data with the replacement character you've specified and continue as best it can.
Unfortunately (as far as I know) this doesn't work due to bugs in rakudo/moarvm as outlined in my answer to decode with replacement does not seem to work.
I did not file an issue at the time I wrote that SO. Your new SO has prompted me to file two bug reports:
.decode's replacement option didn't work in Rakudo v2019.03.01 and presumably still doesn't #3509
decoder replacement options didn't work in Rakudo v2019.03.01 and presumably still don't #1245
Some other options are given in the answers to error message: Malformed UTF-8.
I see in your repl examples you've tried .decode('utf8-c8'). This may be your best bet within raku as it stands.
If none of the above is helpful, I think you're stuck for now with using an external tool to preprocess files before they get to raku.
Is there a predefined character class for all good utf-8 chars
utf8 data is not characters. It's just bytes. The data encodes characters, or at least it's supposed to, but it's very important to keep encodings and characters separate in your mind.
If you know how old-fashioned telegrams work, it's like that. There's a message in characters. And then morse code for transmitting it. They're very different things.
When you see "Malformed UTF-8" or similar, it means the decoder is choking on some part of the data (the bytes). They make no sense to it as characters. It's like morse code that doesn't follow the rules for morse code.
Such data is considered to be confusing crap at best and dangerous crap at worst. The Unicode standard requires that it is entirely eliminated before you can do anything with it.
The obvious friendly solution is to replace crap with a user specified replacement character as you asked. In contrast, a regex character class is both the wrong tool and too late.
Example: from REPL
This is another whole ball of wax.
There's:
The encoding used by your (terminal on your) local system;
The characters you see rendered, and the indication of the cursor, when you use your local system;
What's in your cut/paste buffer when you copy from your repl display;
What your browser does with that buffer when you paste into the edit window for an SO question;
What SO's servers do with that the contents of the edit window when you click the Post your question button and when SO renders your question;
What my local system, browser, terminal, cut/paste buffer, etc. are doing when I look at your SO question;
Etc.
This complexity exists even if both our systems and both you and I are doing what we're supposed to be doing. So, sure, something is amiss with the cursor and other issues, but I'm not going to try nail that down with this answer because, unlike the first part of your question I answered above, it's not really to do with raku/do.
I was trying to determine whether the issues you're seeing are due to the REPL, or some other factor. Here's a link to a gist from your input code:
https://gist.github.com/jubilatious1/b99def4cb2d02e6cef5c15b3fd102447
I removed spaces inside the doublequotes to force an error (if any). I inserted a semicolon at the end of every code line before the comment (if any). I moved one problematic line, say $a.decode;, to the very end. Then I tested the gist with a fairly recent version of Rakudo:
~$ raku --version
Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2020.10.
Here's the output I see:
~$ raku lisprogtor_unicode_SO.p6
「â」
Nil
1
â
â
â
â
(â)
226
xE2
â
1
「â」
alpha => 「â」
Nil
----
Malformed termination of UTF-8 string
in block <unit> at lisprogtor_unicode_SO.p6 line 36
I'm wondering if this means some/many of the Unicode errors you've encountered are either 1) confined to the REPL, or 2) have been resolved since you first posted?
HTH.
(updated 11/24/2020).
So I'm trying to write a regex to use for a grep command on an SVN status command. I want only files with conflicts to be displayed, and if it's a tree conflict, the extra information SVN provides about it (which is on a line with a > character).
So, here's my description of how SVN outputs lines with conflicts, and then I'll show my regex:
[Single Char Code][Spaces][Letter "C"][Space]Filename
[Spaces][Letter "C"][Space]Filename
[Letter "C"][Space]Filename
This is what I have so far to try and get the proper regex. The second part, after the OR condition, works fine to get the tree conflict extra line. It's the first part, where I'm trying to get lines with the letter C under very specific conditions.
Anyway, I'm not exactly the greatest with Regex, so some help here (plus an explanation of what I'm doing wrong, so I can learn from this) would be great.
CONFLICTS=($(svn status | grep "^(.)*C\s\|>"))
Thanks.
This regex should match your lines :
CONFLICTS=$(svn status | grep '^[ADMRCXI?!~ ]\? *C')
^[ADMRCXI?!~ ]\?: lines starting with zero or one \?status character ^[ADMRCXI?!~ ]
*zero or more spaces
character C
I removed the extra parenthesis surrounding the command substitution.
You have to read description of svn st output more deeply and try to get at least one Tree Conflict.
I'll start it for you:
> The first seven columns in the output are each one character wide:
>...
> Seventh column: Whether the item is the victim of a tree conflict
>...
> 'C' tree-Conflicted
and note: theoretically any of these 7 columns can be non-empty
status for tree-conflict
M wc/bar.c
! C wc/qaz.c
> local missing, incoming edit upon update
D wc/qax.c
Dirty lazy draft of regexp
^[enumerate_all_chars_here]{6}C\s
I have a document in vim which contains encoding-related chars I want to get rid of (e.g. replace with "").
I have a general problem in describing their origin. There are examples of how they are displayed in different editors (my desired tool to get rid of them is vim).
in vim:
Oś<9c>więcim (<9c> is a part I would like to get rid of)
in Geany:
(but copy-paste copies without this 'square' sign)
in LibreOffice Calc:
Please note there are other Polish-langauage-specific signs in my text whcih are displayed correct.
Q: how to regex it out in vim?
You can enter the <9c> via :help i_CTRL-V_digit by pressing Ctrl + V (on Windows, often Ctrl + Q instead), followed by X and the hexadecimal number:
:%s/<C-V>x9c//g
Alternatively, the special \%x9c regular expression atom matches that value:
:%s/\%x9c//g
Alternatively, you could also just yank the character when the cursor is on it via yl, and then paste in the :s command-line via <C-R>".