Perl utf8 newline replacement - regex

EDIT:
Sorry! Seems the weird line break behaviour for arabic and other text is due to something else entirely. Unfortunately I noticed it the same time I was playing with this script.
I'm trying to reformat the text field given by TTYtter in Perl. (Source here)
The text is defined as "The actual UTF-8 text of the status update. See twitter-text for details on what is currently considered valid characters." (From Twitter dev pages).
Using
$txtin = $ref->{'text'};
$txtin =~ s/\\n\s*/ \\ /g;
Strips out and replaces newline's fine for 'English' (western?) text, but does some odd things for other languages.
Greek & Arabic text seems to get newlines added to it using this replacement string method.
I've tried matching on \p{Zl} (Found in CPAN-perlunicode.pod) eg:
$txtin =~ s/\p{Z1}\s*/ \\ /g;
But that leaves \n in westernized tweets, so it's not matching what I'd expected / hoped for.
So basically, my question is: How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
Thank you!
EDIT: If you missed the first edit and read this far, this is a question based on a false assumption. It wasn't the newline stripping causing the problem. Apparently it's a text wrap problem totally unrelated to the above. This question now flagged for moderation (since I can't delete it).

\\ matches a single backslash character, so /\\p{Z1}/ matches a backslash, and then the literal string p{Z1}. To match the character class \p{Z1}, you'll either want one more or one fewer backslash at the beginning of the regular expression, depending on whether the input contains backslashes.

s/\\n\s*/ \\ /g does not strip out and replace newline's fine for 'English' (western?) text[1], and it doesn't add newlines for Greek and Arabic text. I don't know what you did use, but to replace a newline optionally followed by whitespace, you use the following on the decoded text:
s/\n\s*/.../g
\n matches a newline.
\\n matches a the two characters \n.
\p{Z1} matches U+2028 LINE SEPARATOR (but not a newline).
\\p{Z1} matches the 6 characters \p{Z1}.
A newline is a newline, no matter what other characters might be near it.
How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
A newline is a newline no matter what other characters may be near it. Same goes for carriage returns.
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/[\r\n]/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Or maybe you're asking how to replace vertical whitespace characters?
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/\v/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Notes:
Unless they happen to follow a backslash and an "n".

Ahhh. Apparently this is one way to close it. See EDIT's in original. Apparently it's a word wrap problem, not related to stripping out newlines.

Related

Regex to exclude quoted strings

I know that there are tons of similar question; I read hundreds, but...for my litlle knowledge of English and my even lower knowledge of Regex, I'am still in the fog.
I need to elaborate a quite large text file which includes paragraphs in two formats: enclosed in quotes or not; in both cases paragraphs could have one or more Carriage Return. I have to process only the lines enclosed in quotes. So: "This is \r a phrase" must be processed (actually I have to replace the \r with ad dummy character like '#'), while 'This is \r a comment' must be excluded.
I tried this pattern: "[\s\S(\r)]+"
This correctly selects only the enclosed paragraphs, but the regex debugger does not report the \r group to be replaced.
Try this pattern: "[\s\S](\r)[\s\S]"
You need to escape the \ character, since \r means something specific with RegEx.

Vim regex to replace zero or one spaces at the end of a line with two spaces [duplicate]

From question How to replace a character for a newline in Vim?. You have to use \r when replacing text for a newline, like this
:%s/%/\r/g
But when replacing end of lines and newlines for a character, you can do it like:
:%s/\n/%/g
What section of the manual documents these behaviors, and what's the reasoning behind them?
From http://vim.wikia.com/wiki/Search_and_replace :
When Searching
...
\n is newline, \r is CR (carriage return = Ctrl-M = ^M)
When Replacing
...
\r is newline, \n is a null byte (0x00).
From vim docs on patterns:
\r matches <CR>
\n matches an end-of-line -
When matching in a string instead of
buffer text a literal newline
character is matched.
Another aspect to this is that \0, which is traditionally NULL, is taken in
s//\0/ to mean "the whole matched pattern". (Which, by the way, is redundant with, and longer than, &).
So you can't use \0 to mean NULL, so you use \n
So you can't use \n to mean \n, so you use \r.
So you can't use \r to mean \r, but I don't know who would want to add that char on purpose.
—☈
:help NL-used-for-Nul
Technical detail:
<Nul> characters in the file are stored as <NL> in memory. In the display
they are shown as "^#". The translation is done when reading and writing
files. To match a <Nul> with a search pattern you can just enter CTRL-# or
"CTRL-V 000". This is probably just what you expect. Internally the
character is replaced with a <NL> in the search pattern. What is unusual is
that typing CTRL-V CTRL-J also inserts a <NL>, thus also searches for a <Nul>
in the file. {Vi cannot handle <Nul> characters in the file at all}
First of all, open :h :s to see the section "4.2 Substitute" of documentation on "Change". Here's what the command accepts:
:[range]s[ubstitute]/{pattern}/{string}/[flags] [count]
Notice the description about pattern and string
For the {pattern} see |pattern|.
{string} can be a literal string, or something
special; see |sub-replace-special|.
So now you know that the search pattern and replacement patterns follow different rules.
If you follow the link to |pattern|, it takes you to the section that explains the whole regexp patterns used in Vim.
Meanwhile, |sub-replace-special| takes you to the subsection of "4.2 Substitute", which contains the patterns for substitution, among which is \r for line break/split.
(The shortcut to this part of manual is :h :s%)

Notepad++ weird bug? when replace a huge string

I'm getting a CR LF characters after replacing a huge string with Notepad++.
Moreover, The string add a line break in places which I didn't ask.
Weird...
Here is the print screen:
Those CR LF character haven't been there before I was using string replace (or they where hidden? and if so why the string replace revealed them?)
Is there is a quick (regex?) solution to remove them ?
Is there any quick (regex?) solution to remove any characters that is NOT [a-z] [A-Z] [0-9] ["|'] OR NON UTF-8 characters ?
You can just replace \r\n with nothing, and that will remove the line breaks.
To remove any character that is not [a-z][A-Z][0-9]["|'], replace [^A-Za-z0-9"|'] with nothing. But be careful that you've thought of everything you do want to keep: spaces, tabs, other punctuation, etc.

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

How to test to see if a string is only whitespace in perl

Whats a good way to test to see if a string is only full of whitespace characters with regex?
if($string=~/^\s*$/){
#is 100% whitespace (remember 100% of the empty string is also whitespace)
#use /^\s+$/ if you want to exclude the empty string
}
(I have decided to edit my post to include concepts in the below conversation with tobyodavies.)
In most instances, you want to determine whether or not something is whitespace, because whitespace is relatively insignificant and you want to skip over a string consisting of merely whitespace. So, I think what you want to determine is whether or not there are significant characters.
So I tend to use the reverse test: $str =~ /\S/. Determining the predicate "string contains one Significant character".
However, to apply your particular question, this can be determined in the negative by testing: $str !~ /\S/
Your regex statement should look for ^\s+$. It will require at least one whitespace.
In case you were wondering, "white space is defined as [\t\n\f\r\p{Z}]". See http://userguide.icu-project.org/strings/regexp.
\t Match a HORIZONTAL TABULATION, \u0009.
\n Match a LINE FEED, \u000A.
\f Match a FORM FEED, \u000C.
\r Match a CARRIAGE RETURN, \u000D.
\p{UNICODE PROPERTY NAME} Match any character with the specified Unicode Property.