Switching multi-language substring positions using regex - regex

Raw input with lithuanian letters:
Ą.BČ
Ą.BČ D Ę
Ą. BČ
Ą. BČ D Ę
Ą BČ
Ą BČ D Ę
Examples below should not be affected.
ĄB ČD DĘ
Expected result:
BČ Ą.
BČ Ą. D Ę
BČ Ą.
BČ Ą. D Ę
BČ Ą
BČ Ą D Ę
ĄB ČD DĘ
What I've tried:
^(.\.? *)([\p{L}\p{N}\p{M}]*)$
With ReplaceAllString substitution like so
$2 $1
I have tried various patterns but this is the best I could come up for now.
It manages to capture 1st, 3rd and 5th line and successfully substitute like so:
(Except for some extra spaces at the end of lines)
BČ Ą.
Ą.BČ D Ę
BČ Ą.
Ą. BČ D Ę
BČ Ą
Ą BČ D Ę
ĄB ČD DĘ
Explanation:
There is a set of data with varying entries of the underlying basic
structure [FIRST NAME FIRST LETTER][LASTNAME] which I want to ideally
bring to [LASTNAME][SPACE][FIRST NAME FIRST LETTER][DOT]?
Link to regex101:
regex101
Final solution:
^([\p{L}\p{N}\p{M}](?:\. *| +))([\p{L}\p{N}\p{M}]+)
With ReplaceAllString substitution like so
$2 $1

For your example data, you can omit the anchor $ and match either a dot followed by optional spaces, or 1 or more spaces.
To prevent an empty match for the character class, you can repeat it 1 or more times using + instead of *
^(.(?:\. *| +))([\p{L}\p{N}\p{M}]+)
See a regex demo
Note that the . can match any char including a space. You might also change the dot to a single [\p{L}\p{N}\p{M}]

Related

match everything but a given string and do not match single characters from that string

Let's start with the following input.
Input = 'blue, blueblue, b l u e'
I want to match everything that is not the string 'blue'. Note that blueblue should not match, but single characters should (even if present in match string).
From this, If I replace the matches with an empty string, it should return:
Result = 'blueblueblue'
I have tried with [^\bblue\b]+
but this matches the last four single characters 'b', 'l','u','e'
Another solution:
(?<=blue)(?:(?!blue).)+(?=blue|$)|^(?:(?!blue).)+(?=blue|$)
Regex demo
If you regex engine support the \K flag, then we can try:
/blue\K|.*?(?=blue|$)/gm
Demo
This pattern says to match:
blue match "blue"
\K but then forget that match
| OR
.*? match anything else until reaching
(?=blue|$) the next "blue" or the end of the string
Edit:
On JavaScript, we can try the following replacement:
var input = "blue, blueblue, b l u e";
var output = input.replace(/blue|.*?(?=blue|$)/g, (x) => x != "blue" ? "" : "blue");
console.log(output);

How to regex match everything but long words?

I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.
\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.
You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo
In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe
import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string
With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.

Difference between (^|\\s)([A-Z]{1,3})(\\s|$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.
Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.
To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).
The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.

Regex to match repeated pattern after a string

I need a regex that extract pattern after specific word (her like Limits::)
i have teststring ,So let's say the text is always between delimiter !Limits::****! :
*ksjfl kfj sdfasdfaf dfasf asd sdf a dfasd fdaf ad f afdfaf dfad bla bla ksfajs ldsfskj !Limits::WLo1/WHi1/WHi1/WHi1,WLo2/WHi2/WHi/WHi2,.hier repeated pattern..,WLon/WHin/CLon/CHin!
fasdfakl skdfkas sflas fasf sdf afasf
i just want only words :
WLo1
WHi1
WHi1
WHi1
WLo2
WHi2
WHi
WHi2
.
.
.
WLon
WHin
CLon
CHin
i have tested like (?:!\w+::(?:(\w+)/(\w+)/(\w+)/(\w+)))|(?:,(\w+)/(\w+)/(\w+)/(\w+))+.*!, with fail
Regular expressions:
/(W.*|C.*)(?=\/|!|,)/g : match words beginning with W or C followed by / , !, or ,
/\/|,.*(?=,)|,/ : remove / or , or any characters followed by , or , from string returned from first RegExp
var str = "*ksjfl kfj sdfasdfaf dfasf asd sdf a dfasd fdaf ad f afdfaf dfad bla bla ksfajs ldsfskj !Limits::WLo1/WHi1/WHi1/WHi1,WLo2/WHi2/WHi/WHi2,.hier repeated pattern..,WLon/WHin/CLon/CHin! fasdfakl skdfkas sflas fasf sdf afasf";
var res = str.match(/(W.*|C.*)(?=\/|!|,)/g)[0].split(/\/|,.*(?=,)|,/);
document.body.textContent = res.join(" ")
I don't know what the ending delimiter is, so if it matters, update your question and I'll amend this expression:
/(?<=Limits::)(?:(.+?)\/)+/i
Searches for Limits::, then repeating strings ending with /, your words will be in group 1.

Regular Expression: search multiple string with linefeed delimited by ";"

I have a string such this that described a structured data source:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
Every field...
... is starting with "FieldName"
... is ending with ";"
... may contain linefeed
My need is to find with regular expression the values of SampleTestPlan that's repeated twice. So...
1st value is:
2
a b
c d
2nd value is
3
e f
g h
i l
I've performed several attempts with such search string:
/SampleTestPlan(.\s)/gm
/SampleTestPlan(.\s);/gm
/SampleTestPlan(.*);/gm
but I need to understand much better how Regular Expression work as I'm definitively a newbie on them and I need to learn a lot.
Thanks in advance to anyone that may help me!
Stefano, Milan, ITALY
You could use the following regex:
(?<=\w\b)[^;]+(?=;)
See it working live here on regex101!
How it works:
It matches everything that is:
preceded by a sequence of characters: \w+
followed by a ;
contains anything (at least one character) except a ; (including newlines).
For example, for that input:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
It matches 5 times:
whocares
then:
2
a b
c d
then:
abc
then:
3
e f
g h
i l
then:
01
Assuming your input will be always in this well formatted like the sample, try this:
/SampleTestPlan(\s+\d+.*?);/sg
Here, /s modifier means Dot matches newline characters
You can try this at online.
That would be /SameTestPlan([^;]+)/g. [^abc] means any character which is not a, b or c.