How to regex match everything but long words? - regex

I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.

\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.

You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo

In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe

import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string

With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.

Related

Scala regex on a whole column

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.

Replace '-' with space if the next charcter is a letter not a digit and remove when it is at the start

I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo

Difference between (^|\\s)([A-Z]{1,3})(\\s|$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.
Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.
To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).
The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.

Regex to match repeated pattern after a string

I need a regex that extract pattern after specific word (her like Limits::)
i have teststring ,So let's say the text is always between delimiter !Limits::****! :
*ksjfl kfj sdfasdfaf dfasf asd sdf a dfasd fdaf ad f afdfaf dfad bla bla ksfajs ldsfskj !Limits::WLo1/WHi1/WHi1/WHi1,WLo2/WHi2/WHi/WHi2,.hier repeated pattern..,WLon/WHin/CLon/CHin!
fasdfakl skdfkas sflas fasf sdf afasf
i just want only words :
WLo1
WHi1
WHi1
WHi1
WLo2
WHi2
WHi
WHi2
.
.
.
WLon
WHin
CLon
CHin
i have tested like (?:!\w+::(?:(\w+)/(\w+)/(\w+)/(\w+)))|(?:,(\w+)/(\w+)/(\w+)/(\w+))+.*!, with fail
Regular expressions:
/(W.*|C.*)(?=\/|!|,)/g : match words beginning with W or C followed by / , !, or ,
/\/|,.*(?=,)|,/ : remove / or , or any characters followed by , or , from string returned from first RegExp
var str = "*ksjfl kfj sdfasdfaf dfasf asd sdf a dfasd fdaf ad f afdfaf dfad bla bla ksfajs ldsfskj !Limits::WLo1/WHi1/WHi1/WHi1,WLo2/WHi2/WHi/WHi2,.hier repeated pattern..,WLon/WHin/CLon/CHin! fasdfakl skdfkas sflas fasf sdf afasf";
var res = str.match(/(W.*|C.*)(?=\/|!|,)/g)[0].split(/\/|,.*(?=,)|,/);
document.body.textContent = res.join(" ")
I don't know what the ending delimiter is, so if it matters, update your question and I'll amend this expression:
/(?<=Limits::)(?:(.+?)\/)+/i
Searches for Limits::, then repeating strings ending with /, your words will be in group 1.

Get groups with regex and OR

I have something like this
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number of 7 Chars codes following.
With this regex:
^(AB|AC|AD)|((\S{7})?
in this groovy code sample:
def m= Pattern.compile(/^(AB|AC|AD)|((\S{7})?)/).matcher("AC DJKEJKW SJKLAJL JSHELSJ")
println m.getCount()
I always get 8 as count, means it counts the spaces.
How do I get 4 groups (as expected) without spaces ?
Thanks from a not-yet-regex-expert
Sven
Using this code:
def input = [ 'AD ABCDEFG HIJKLMN', 'AB HIJKLMN', 'AC DJKEJKW SJKLAJL JSHELSJ' ]
def regexp = /^(AB|AC|AD)|((\S{7})+)/
def result = input.collect {
matcher = ( it =~ regexp )
println "got $matcher.count for $it"
matcher.collect { it[0] }
}
println result
I get the output
got 3 for AD ABCDEFG HIJKLMN
got 2 for AB HIJKLMN
got 4 for AC DJKEJKW SJKLAJL JSHELSJ
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]
Is this more what you wanted?
This pattern will match your requirements
^A[BCD](?:\s\S{7})+
See it here online on Regexr
Meaning start with A then either a B or a C or a D. This is followed by at least one group consisting of a whitespace followed by 7 non whitespaces.