Replace nth occurence of a character by another - regex

I hope this isn't a duplicated, I didn't find an answer and I need help from regexp wizards.
I have a string and I would like to replace the second space found in it by a \n, but I don't know how to use indices (this way) in a regular expression :
For example :
# I have :
"a b c d e f"
# I want :
> "a b/nc d e f"
Also I would like to know how I can "repeat" this replacement: each two occurences of space replace by \n.
For example :
"a b c d e f"
> "a b\nc d\ne f"

(\\S+\\s+\\S+)\\s+
You can use this and replace by \1\n or $1\n.See demo.
https://regex101.com/r/yG7zB9/29

Related

Regex: Match a pattern within quoted texts

Text is
lemma A:
"
abx K() bc
"
// comment lemma B
lemma B:
"
abx bc sdsf
"
lemma C:
"
abfdfx K() bc
"
lemma D:
"
abxsf bc
"
I want to find the lemmas which contain K() inside its following quoted text. I have tried Perl regex (?s)^[ ]*lemma.*?"(?!").*?K\( but it overlaps two lemmas. The output should be: lemma A: "..." and lemma C: "...".
If the double quotes are at the start of the string, you can match a newline and then the double quote.
Then match any char except the double quote until you match K(
^[ ]*lemma\b.*\R"[^"]*K\(
^ Start of string
[ ]*lemma\b Match optional spaces and lemma
.*\R Match the rest of the line and a newline
"[^"]* Match " followed by optional chars other than "
K\( Match K(
Regex demo
You could use:
(?s)^[ ]*lemma[^"]*"[^"]*?K\(
[^"] means "any character but ""
See a demo here

gsub not replacing all expected matches in R

Let's say I have the string x <- "AbC" and I want to put an ampersand in between each letter. I would have assumed I could just do gsub("([a-zA-Z])([a-zA-Z])", "\\1 & \\2", x), but that produces "A & bC". Why doesn't gsub recognize the second set of letters that match the regex? It's not like gsub only replaces the first match found. If I have x <- "AbC DE" and run the same command, I get "A & bC D & E".
What am I missing in terms of how gsub is doing it's replacement? I would have expected outputs of "A & b & C" and "A & b & C D & E" from the two inputs above.
Because if a character present in one match, regex engine won't match the same character again. That is, it won't do overlapping matches.. Use lookaround to overcome this..
gsub("([a-zA-Z])(?=[a-zA-Z])", "\\1 & ", x, perl=T)
DEMO

Regular Expression: search multiple string with linefeed delimited by ";"

I have a string such this that described a structured data source:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
Every field...
... is starting with "FieldName"
... is ending with ";"
... may contain linefeed
My need is to find with regular expression the values of SampleTestPlan that's repeated twice. So...
1st value is:
2
a b
c d
2nd value is
3
e f
g h
i l
I've performed several attempts with such search string:
/SampleTestPlan(.\s)/gm
/SampleTestPlan(.\s);/gm
/SampleTestPlan(.*);/gm
but I need to understand much better how Regular Expression work as I'm definitively a newbie on them and I need to learn a lot.
Thanks in advance to anyone that may help me!
Stefano, Milan, ITALY
You could use the following regex:
(?<=\w\b)[^;]+(?=;)
See it working live here on regex101!
How it works:
It matches everything that is:
preceded by a sequence of characters: \w+
followed by a ;
contains anything (at least one character) except a ; (including newlines).
For example, for that input:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
It matches 5 times:
whocares
then:
2
a b
c d
then:
abc
then:
3
e f
g h
i l
then:
01
Assuming your input will be always in this well formatted like the sample, try this:
/SampleTestPlan(\s+\d+.*?);/sg
Here, /s modifier means Dot matches newline characters
You can try this at online.
That would be /SameTestPlan([^;]+)/g. [^abc] means any character which is not a, b or c.

Lua text parsing, space handling

I'm a newbie to Lua. And I want to parse the text like
Phase1:A B Phase2:A B Phase3:W O R D Phase4:WORD
to
Phase1 Phase2 Phase3 Phase4
A A B W O R D WORD
I used string.gmatch(s, "(%w+):(%w+)"), I can only get
Phase1 Phase2 Phase3 Phase4
A A W WORD
How can I get missing B, O, R, D back?
Or do I need to write pattern for every phases? How to do that?
The input text in your example doesn't have any clear delimiter between the phrases so parsing it accurately with regex is tricky.
This would be much easier to parse if you add a delimiter symbol like a , to separate the phrases.
Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD
You can then parse it with this pattern:
s = "Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD"
for k, v in s:gmatch "(Phrase%d+):([^,]+)" do
print(k, v)
end
outputs:
Phrase1 A B
Phrase2 A B
Phrase3 W O R D
Phrase4 WORD
If it's not possible to relax the above constraint, you can try this pattern:
s:gmatch "Phrase%d+:%w[%w ]* "
Note there's a caveat with this pattern, the string you're parsing needs to have an extra space at the end or the last phrase won't get parsed.
for k, v in s:gsub('%s*(%w+:)','\0%1'):gmatch'%z(%w+):(%Z*)'
– #Egor Skriptunoff
This pattern works better.

Regular expression replaces invalid words

I have this sentence and i use regular expressions to replace the word "merda" or "merdas" with ---
"merda vamerda e mais mmmerda? a merdaaa lol merda, namerda m e r d a mesmo merda"
This is the regular expression im using:
m{1,}e{1,}r{1,}d{1,}a{1,}s{1,}|m{1,}e{1,}r{1,}d{1,}a{1,}
and this is the result:
"--- va --- e mais --- ? a --- lol --- , na --- m e r d a mesmo ---"
3 errors here, vamerda and namerda should not be replaced, and it didnt replace m e r d a.
Can you help me please?
how about :
/\bm+\s*e+\s*r+\s*d+\s*a+\s*s*\b/
explanation:
\b : word boundary
m+ : matches 1 or more m
\s* : matches 0 or more spaces
... same explanation for other letters (e,r,d,a)
s* : matches 0 or more s
\b : word boundary
This will match all expected combinations in the given example.
Edit
According to your comment, you can modify the regex by exchanging each \s* with [\s_]* like :
\bm+[\s_]*e+[\s_]* and so on ...
or even with:
\bm+[^a-z]* ...
Try putting your regular expression in Rubular
It will give you real-time match results, as you modify your regex.
Here's a link to your expression in Rubular permalink
Try this one:
/\Amerda\s+|\smerda,|\smerda\z|\s+merdas\s+|m\se\sr\sd\sa\s/