Notepad++ weird bug? when replace a huge string

Notepad++ weird bug? when replace a huge string - regex

I'm getting a CR LF characters after replacing a huge string with Notepad++.
Moreover, The string add a line break in places which I didn't ask.
Weird...
Here is the print screen:
Those CR LF character haven't been there before I was using string replace (or they where hidden? and if so why the string replace revealed them?)
Is there is a quick (regex?) solution to remove them ?
Is there any quick (regex?) solution to remove any characters that is NOT [a-z] [A-Z] [0-9] ["|'] OR NON UTF-8 characters ?

You can just replace \r\n with nothing, and that will remove the line breaks.
To remove any character that is not [a-z][A-Z][0-9]["|'], replace [^A-Za-z0-9"|'] with nothing. But be careful that you've thought of everything you do want to keep: spaces, tabs, other punctuation, etc.

Related

Regex in "Find and Replace"; how to match \n (newline character)?

I'm not sure whether I couldn't find the correct way or this is a bug.
I wanted to check some reference manual but there doesn't seem to be one.
In Jupyter's Find and Replace screen, there's an icon .* to check when I want to use regex.
Mostly it works fine, but if I try to match a line break (\n), it does not match it unless it is that very character. For example, I want to match every line that doesn't end with , and join that line to the next one. I'd match [^,]\n and replace with ,, which would remove the line break. I could try [^,]$, but replacing this wouldn't remove the line break.
How do I do this?

There are a lot of variants of the new-line character.
E.g.: \r or \n
Regex Pattern
Anyway, here is the pattern using lookahead to check if there is a comma before, and the variants of new-line character.
(?<!,)(\r?(\n|\r))
Regex Demo

RegExp to match visible non-letter characters before line break

I am working on a vbs regexp that will detect a tag which contains text and a CRLF character before closing tag.
I am currently using \w+[:;?!.,""\)\]-~]*(\s)*(\r\n\s*)(<\/.*>)
Looking from the end of the expression, I am matching any closing tag, CRLF plus optionally blank spaces, an optional spaces before CRLF and it should optionally match any other visible non-letter character which occurs after any word.
This is to match things like
myword! CRLF</tag>
mywordCRLF</tag>
myword CRLF</tag>
myword...CRLF </tag>
etc.
However, I do not want to match below, as I need to detect tags containing TEXT and linebreaks.
</otherclosingtag> CRLF </tag>
I am concerned about the \w+[:;?!.,""\)\]-~]* bit as it doesn't look right to me, as I would need to insert quite a large number of characters here.
I tried replacing it with \S, \W but they all seem to match CRLF characters as well.
Any ideas?
Cheers!

How about using non-greedy modifier:
\w+\W*?\r\n\s*(<\/.*>)
or
\w+[^\r\n]*\r\n\s*(<\/.*>)

The solution that I used:
\w+[^\r\n<>]*(\r\n\s*)(<\/.*>)
It matches a word (so not ) then anything that is not the CR, LF or > (so it doesn't match openingtag> CRLF</closingtag>)
This is a modified version of what M42 has proposed, I had added <> to make sure we won't match a tag.
Thanks for suggestions!

Try this:
^.*[\n\t\s]*</.*>$ --> BAD
^.*[\r\n\t\s]*</.*>$

Perl utf8 newline replacement

EDIT:
Sorry! Seems the weird line break behaviour for arabic and other text is due to something else entirely. Unfortunately I noticed it the same time I was playing with this script.
I'm trying to reformat the text field given by TTYtter in Perl. (Source here)
The text is defined as "The actual UTF-8 text of the status update. See twitter-text for details on what is currently considered valid characters." (From Twitter dev pages).
Using
$txtin = $ref->{'text'};
$txtin =~ s/\\n\s*/ \\ /g;
Strips out and replaces newline's fine for 'English' (western?) text, but does some odd things for other languages.
Greek & Arabic text seems to get newlines added to it using this replacement string method.
I've tried matching on \p{Zl} (Found in CPAN-perlunicode.pod) eg:
$txtin =~ s/\p{Z1}\s*/ \\ /g;
But that leaves \n in westernized tweets, so it's not matching what I'd expected / hoped for.
So basically, my question is: How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
Thank you!
EDIT: If you missed the first edit and read this far, this is a question based on a false assumption. It wasn't the newline stripping causing the problem. Apparently it's a text wrap problem totally unrelated to the above. This question now flagged for moderation (since I can't delete it).

\\ matches a single backslash character, so /\\p{Z1}/ matches a backslash, and then the literal string p{Z1}. To match the character class \p{Z1}, you'll either want one more or one fewer backslash at the beginning of the regular expression, depending on whether the input contains backslashes.

s/\\n\s*/ \\ /g does not strip out and replace newline's fine for 'English' (western?) text[1], and it doesn't add newlines for Greek and Arabic text. I don't know what you did use, but to replace a newline optionally followed by whitespace, you use the following on the decoded text:
s/\n\s*/.../g
\n matches a newline.
\\n matches a the two characters \n.
\p{Z1} matches U+2028 LINE SEPARATOR (but not a newline).
\\p{Z1} matches the 6 characters \p{Z1}.
A newline is a newline, no matter what other characters might be near it.
How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
A newline is a newline no matter what other characters may be near it. Same goes for carriage returns.
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/[\r\n]/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Or maybe you're asking how to replace vertical whitespace characters?
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/\v/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Notes:
Unless they happen to follow a backslash and an "n".

Ahhh. Apparently this is one way to close it. See EDIT's in original. Apparently it's a word wrap problem, not related to stripping out newlines.

Parameterize block of text using Notepad++

I have the following text in Notepad++
A
B
C
D
I would like to "parameterize" this text and turn it into this using a regex or some other native Notepad++ command(s) or plugin:
'A', 'B', 'C', 'D'
Note that I want the end text to be on one line and no trailing comma, if possible. This question gets me close but I am left with a trailing comma and the text is not compacted to one line. Is there anyway to accomplish this in Notepad++ without using a macro?

Try this in Regex Search Mode.
Search for (\w)\r\n
Replace with ('\1', )
But you will have to remove the space and a comma manually from the end of the line.

You can do it in two steps:
Search for e.g. (\w+) and replace with '$1'
The \w+ will find the letters (and digits and the underscore), at least one.
Search for (\s+) and replace with ,
\s+ will find whitespace characters, that means here the newline characters at the end of a row. If you have whitespace in your text, you want to keep, use [\r\n]+ instead.
This way, if there is no newline after the last letter, there will be no trailing comma.

Vim regex not matching spaces in a character class

I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?

The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).

Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Notepad++ weird bug? when replace a huge string - regex

You can just replace \r\n with nothing, and that will remove the line breaks. To remove any character that is not [a-z][A-Z][0-9]["|'], replace [^A-Za-z0-9"|'] with nothing. But be careful that you've thought of everything you do want to keep: spaces, tabs, other punctuation, etc.

Related

Regex in "Find and Replace"; how to match \n (newline character)?

RegExp to match visible non-letter characters before line break

Perl utf8 newline replacement

Parameterize block of text using Notepad++

Vim regex not matching spaces in a character class

Categories

Resources