Regular expression replaces invalid words - regex

I have this sentence and i use regular expressions to replace the word "merda" or "merdas" with ---
"merda vamerda e mais mmmerda? a merdaaa lol merda, namerda m e r d a mesmo merda"
This is the regular expression im using:
m{1,}e{1,}r{1,}d{1,}a{1,}s{1,}|m{1,}e{1,}r{1,}d{1,}a{1,}
and this is the result:
"--- va --- e mais --- ? a --- lol --- , na --- m e r d a mesmo ---"
3 errors here, vamerda and namerda should not be replaced, and it didnt replace m e r d a.
Can you help me please?

how about :
/\bm+\s*e+\s*r+\s*d+\s*a+\s*s*\b/
explanation:
\b : word boundary
m+ : matches 1 or more m
\s* : matches 0 or more spaces
... same explanation for other letters (e,r,d,a)
s* : matches 0 or more s
\b : word boundary
This will match all expected combinations in the given example.
Edit
According to your comment, you can modify the regex by exchanging each \s* with [\s_]* like :
\bm+[\s_]*e+[\s_]* and so on ...
or even with:
\bm+[^a-z]* ...

Try putting your regular expression in Rubular
It will give you real-time match results, as you modify your regex.
Here's a link to your expression in Rubular permalink

Try this one:
/\Amerda\s+|\smerda,|\smerda\z|\s+merdas\s+|m\se\sr\sd\sa\s/

Related

Regex: Match a pattern within quoted texts

Text is
lemma A:
"
abx K() bc
"
// comment lemma B
lemma B:
"
abx bc sdsf
"
lemma C:
"
abfdfx K() bc
"
lemma D:
"
abxsf bc
"
I want to find the lemmas which contain K() inside its following quoted text. I have tried Perl regex (?s)^[ ]*lemma.*?"(?!").*?K\( but it overlaps two lemmas. The output should be: lemma A: "..." and lemma C: "...".
If the double quotes are at the start of the string, you can match a newline and then the double quote.
Then match any char except the double quote until you match K(
^[ ]*lemma\b.*\R"[^"]*K\(
^ Start of string
[ ]*lemma\b Match optional spaces and lemma
.*\R Match the rest of the line and a newline
"[^"]* Match " followed by optional chars other than "
K\( Match K(
Regex demo
You could use:
(?s)^[ ]*lemma[^"]*"[^"]*?K\(
[^"] means "any character but ""
See a demo here

Remove all numbers + symbols from line in Notepad++

Is it possible to remove every line in a notepad++ Not Containing
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
, . '
Like that :
Remove Non-ascii
.*[^\x00-\x7F]+.*
Remove Numbers
.*[0-9]+.*
Text :
example
example'
example,
example.
example123
éxample è
[example/+
example'/é,
example,*
exa'mple--
example#
example"
You may use
^(?![a-zA-Z,.']+$).+$\R?
The regex matches any non-empty line (.+) that does not only consist of ASCII letters, ,, . or '. \R? at the end matches an optional line break.
Details:
^ - start of a string
(?![a-zA-Z,.']+$) - a negative lookahead that fails the match if its pattern is not matched: [a-zA-Z,.']+ - 1 or more ASCII letters, comma, period or single quote up to the end of the line ($)
.+ - 1+ chars other than line break char
$ - end of a line
\R? - an optional line break char (sequence)
You can remove them like this:
Find what: ^.*[^a-zA-Z.,'].*$
Replace with: ``
Explanation:
.* for any text
the negated character class [^...] for any unwanted character
then again .* for more any text
You need to wrap it into ^...$ to match the whole line
If you want to delete the linefeed characters, then you can use \r?\n instead of the $ sign. I.e.: ^.*[^a-zA-Z.,'].*\r?\n
Try to replace all this match
^.+?[^a-zA-Z,.'\r\n]+(.|\r?\n)

How to match regular expression exactly in R and pull out pattern

I want to get pattern from my vector of strings
string <- c(
"P10000101 - Przychody netto ze sprzedazy produktów" ,
"P10000102_PL - Przychody nettozy uslug",
"P1000010201_PL - Handlowych, marketingowych, szkoleniowych",
"P100001020101 - - Handlowych,, szkoleniowych - refaktury",
"- Handlowych, marketingowych,P100001020102, - pozostale"
)
As result I want to get exact match of regular expression
result <- c(
"P10000101",
"P10000102_PL",
"P1000010201_PL",
"P100001020101",
"P100001020102"
)
I tried with this pattern = "([PLA]\\d+)" and different combinations of value = T, fixed = T, perl = T.
grep(x = string, pattern = "([PLA]\\d+(_PL)?)", fixed = T)
We can try with str_extract
library(stringr)
str_extract(string, "P\\d+(_[A-Z]+)*")
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
grep is for finding whether the match pattern is present in a particular string or not. For extraction, either use sub or gregexpr/regmatches or str_extract
Using the base R (regexpr/regmatches)
regmatches(string, regexpr("P\\d+(_[A-Z]+)*", string))
#[1] "P10000101" "P10000102_PL" "P1000010201_PL" "P100001020101" "P100001020102"
Basically, the pattern to match is P followed by one more numbers (\\d+) followed by greedy (*) match of _ and one or more upper case letters.

Replace nth occurence of a character by another

I hope this isn't a duplicated, I didn't find an answer and I need help from regexp wizards.
I have a string and I would like to replace the second space found in it by a \n, but I don't know how to use indices (this way) in a regular expression :
For example :
# I have :
"a b c d e f"
# I want :
> "a b/nc d e f"
Also I would like to know how I can "repeat" this replacement: each two occurences of space replace by \n.
For example :
"a b c d e f"
> "a b\nc d\ne f"
(\\S+\\s+\\S+)\\s+
You can use this and replace by \1\n or $1\n.See demo.
https://regex101.com/r/yG7zB9/29

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.