Regex to exclude quoted strings - regex

I know that there are tons of similar question; I read hundreds, but...for my litlle knowledge of English and my even lower knowledge of Regex, I'am still in the fog.
I need to elaborate a quite large text file which includes paragraphs in two formats: enclosed in quotes or not; in both cases paragraphs could have one or more Carriage Return. I have to process only the lines enclosed in quotes. So: "This is \r a phrase" must be processed (actually I have to replace the \r with ad dummy character like '#'), while 'This is \r a comment' must be excluded.
I tried this pattern: "[\s\S(\r)]+"
This correctly selects only the enclosed paragraphs, but the regex debugger does not report the \r group to be replaced.

Try this pattern: "[\s\S](\r)[\s\S]"
You need to escape the \ character, since \r means something specific with RegEx.

Related

Regex matching, but not inside latex environment

I want to replace quotation marks in a latex document. It's written in German, which means that all quotation marks should be of the form "´text"' but some editors of the document have used these: "text", ´´text''.
The complication here is, that the document contains highlighted code using the lstlisting enviroment. In there the quotation marks should not be replaced.
I have a regex, that matches text inside the unwanted quotes, even if there are multiple words:
((``((\w+\s*)+)'')|("((\w+\s*)+)"))
I also have a regex, that matches a string ("asdf" in this case), only if it is not inside the lstlisting environment:
"asdf"(?=((?!\\end\{lstlisting\}).)*\\begin\{lstlisting\}?)
They work fine on their own, but when I combine them like this:
((``((\w+\s*)+)'')|("((\w+\s*)+)"))(?=((?!\\end\{lstlisting\}).)*\\begin\{lstlisting\}?)
some of the quoted strings, that should be matched are not and additionally the whole document is matched.
PS: I am currently using notepad++ for matching, because it allows . to match \n
[EDIT]: It works fine, as long as I limit the first part to single words:
((``((\w)+)'')|("((\w)+)"))(?=((?!\\end\{lstlisting\}).)*\\begin\{lstlisting\}?)
To match words with whitespaces, you can use
(``[\w\s]+''|"[\w\s]+")(?=(?:(?!\\end\{lstlisting\}).)*\\begin\{lstlisting\}?)
See regex demo
If you have spaces only between `` and '', or between "s, you will need to unroll the [\w\s]+ part as \w+(?:\s+\w+)*.

RegExp to match visible non-letter characters before line break

I am working on a vbs regexp that will detect a tag which contains text and a CRLF character before closing tag.
I am currently using \w+[:;?!.,""\)\]-~]*(\s)*(\r\n\s*)(<\/.*>)
Looking from the end of the expression, I am matching any closing tag, CRLF plus optionally blank spaces, an optional spaces before CRLF and it should optionally match any other visible non-letter character which occurs after any word.
This is to match things like
myword! CRLF</tag>
mywordCRLF</tag>
myword CRLF</tag>
myword...CRLF </tag>
etc.
However, I do not want to match below, as I need to detect tags containing TEXT and linebreaks.
</otherclosingtag> CRLF </tag>
I am concerned about the \w+[:;?!.,""\)\]-~]* bit as it doesn't look right to me, as I would need to insert quite a large number of characters here.
I tried replacing it with \S, \W but they all seem to match CRLF characters as well.
Any ideas?
Cheers!
How about using non-greedy modifier:
\w+\W*?\r\n\s*(<\/.*>)
or
\w+[^\r\n]*\r\n\s*(<\/.*>)
The solution that I used:
\w+[^\r\n<>]*(\r\n\s*)(<\/.*>)
It matches a word (so not ) then anything that is not the CR, LF or > (so it doesn't match openingtag> CRLF</closingtag>)
This is a modified version of what M42 has proposed, I had added <> to make sure we won't match a tag.
Thanks for suggestions!
Try this:
^.*[\n\t\s]*</.*>$ --> BAD
^.*[\r\n\t\s]*</.*>$

Perl utf8 newline replacement

EDIT:
Sorry! Seems the weird line break behaviour for arabic and other text is due to something else entirely. Unfortunately I noticed it the same time I was playing with this script.
I'm trying to reformat the text field given by TTYtter in Perl. (Source here)
The text is defined as "The actual UTF-8 text of the status update. See twitter-text for details on what is currently considered valid characters." (From Twitter dev pages).
Using
$txtin = $ref->{'text'};
$txtin =~ s/\\n\s*/ \\ /g;
Strips out and replaces newline's fine for 'English' (western?) text, but does some odd things for other languages.
Greek & Arabic text seems to get newlines added to it using this replacement string method.
I've tried matching on \p{Zl} (Found in CPAN-perlunicode.pod) eg:
$txtin =~ s/\p{Z1}\s*/ \\ /g;
But that leaves \n in westernized tweets, so it's not matching what I'd expected / hoped for.
So basically, my question is: How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
Thank you!
EDIT: If you missed the first edit and read this far, this is a question based on a false assumption. It wasn't the newline stripping causing the problem. Apparently it's a text wrap problem totally unrelated to the above. This question now flagged for moderation (since I can't delete it).
\\ matches a single backslash character, so /\\p{Z1}/ matches a backslash, and then the literal string p{Z1}. To match the character class \p{Z1}, you'll either want one more or one fewer backslash at the beginning of the regular expression, depending on whether the input contains backslashes.
s/\\n\s*/ \\ /g does not strip out and replace newline's fine for 'English' (western?) text[1], and it doesn't add newlines for Greek and Arabic text. I don't know what you did use, but to replace a newline optionally followed by whitespace, you use the following on the decoded text:
s/\n\s*/.../g
\n matches a newline.
\\n matches a the two characters \n.
\p{Z1} matches U+2028 LINE SEPARATOR (but not a newline).
\\p{Z1} matches the 6 characters \p{Z1}.
A newline is a newline, no matter what other characters might be near it.
How do I replace all newline / cr characters in a utf8 blob of text (a tweet), that will work for cyrillic, arabic, kanji & western content in Perl?
A newline is a newline no matter what other characters may be near it. Same goes for carriage returns.
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/[\r\n]/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Or maybe you're asking how to replace vertical whitespace characters?
utf8::decode( my $unicode_chars = $utf8_bytes );
$unicode_chars =~ s/\v/.../g;
utf8::encode( $utf8_bytes = $unicode_chars );
Notes:
Unless they happen to follow a backslash and an "n".
Ahhh. Apparently this is one way to close it. See EDIT's in original. Apparently it's a word wrap problem, not related to stripping out newlines.

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

How to test to see if a string is only whitespace in perl

Whats a good way to test to see if a string is only full of whitespace characters with regex?
if($string=~/^\s*$/){
#is 100% whitespace (remember 100% of the empty string is also whitespace)
#use /^\s+$/ if you want to exclude the empty string
}
(I have decided to edit my post to include concepts in the below conversation with tobyodavies.)
In most instances, you want to determine whether or not something is whitespace, because whitespace is relatively insignificant and you want to skip over a string consisting of merely whitespace. So, I think what you want to determine is whether or not there are significant characters.
So I tend to use the reverse test: $str =~ /\S/. Determining the predicate "string contains one Significant character".
However, to apply your particular question, this can be determined in the negative by testing: $str !~ /\S/
Your regex statement should look for ^\s+$. It will require at least one whitespace.
In case you were wondering, "white space is defined as [\t\n\f\r\p{Z}]". See http://userguide.icu-project.org/strings/regexp.
\t Match a HORIZONTAL TABULATION, \u0009.
\n Match a LINE FEED, \u000A.
\f Match a FORM FEED, \u000C.
\r Match a CARRIAGE RETURN, \u000D.
\p{UNICODE PROPERTY NAME} Match any character with the specified Unicode Property.