I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.
\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.
You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo
In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe
import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string
With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.
Suppose I have string str = "aabaa"
Its non repetitive substrings are
a
b
aa
ab
ba
aab
aba
baa
aaba
abaa
aabaa
Compute the suffix array and the longest common prefix array thereof.
a
1
aa
2
aabaa
1
abaa
0
baa
Return (n+1)n/2, the number of substring bounds, minus the sum of the longest common prefix array.
(5+1)5/2 - (1+2+1+0) = 15 - 4 = 11.
Is there a way to replace some string in a text using the original matched values?
For instance, I would like to replace all the integers by decimals, as in the following example:
"hello 45 hello 4 bye" --> "hello 45.0 hello 4.0 bye"
I could match all the numbers with findAllIn and after replace them but I would like to know if there is a better solution.
Using RegularExpressions, you can use $1 to get the result of the first capturing group (in parenthesis):
val regex = "(\\d+)".r
val text = "hello 45 hello 4 bye"
val result = regex.replaceAllIn(text, "$1.0")
// result: String = hello 45.0 hello 4.0 bye
Use the overload of replaceAllIn that takes a replacer function:
http://www.scala-lang.org/api/current/index.html#scala.util.matching.Regex#replaceAllIn(target:CharSequence,replacer:scala.util.matching.Regex.Match=>String):String
I have a pattern like this:
# or # + 1 or 2 words + : + 1 words or more + link
So as you can see I have a sentence starting with # or # and end with link
like:
#justin Trudue:I do not [go there][1]
I wrote the following code to check any sentence with this pattern:
private static void patternFinder(String commentstr){
String urlPattern = "#|#({1}|{2}):\\w+(https://\\w+|http://\\w+)";
Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(commentstr);
if (m.find()) {
System.out.println("yes");
}
}
but this does not work for example for this sentence ,should be written yes but nothing happens:
#hello sss: xxx
is there anything wrong with my regx?
Give this a try
[#|#]((?:\w+\s?){1,2}):\s?((?:\w+\s?){1,})((?:http|https):\/\/.+)
test
#hello sss: xxx https://t.co/3WHshzDG7m
#hello sss: xxx another word https://t.co/3WHshzDG7m
#hello sss third: xxx another word https://t.co/3WHshzDG7m
Result
MATCH 1
[1-10] hello sss
[12-16] xxx
[16-39] https://t.co/3WHshzDG7m
MATCH 2
[41-50] hello sss
[52-69] xxx another word
[69-92] https://t.co/3WHshzDG7m
Online demo https://regex101.com/r/zU7lP2/1
Another version if you do not want to fix the link protocol
[#|#]((?:\w+\s?){1,2}):\s?((?:\w+\s?){1,})((?<=\s)\w+:\/\/.+)
Online demo https://regex101.com/r/zU7lP2/2
I have something like this
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number of 7 Chars codes following.
With this regex:
^(AB|AC|AD)|((\S{7})?
in this groovy code sample:
def m= Pattern.compile(/^(AB|AC|AD)|((\S{7})?)/).matcher("AC DJKEJKW SJKLAJL JSHELSJ")
println m.getCount()
I always get 8 as count, means it counts the spaces.
How do I get 4 groups (as expected) without spaces ?
Thanks from a not-yet-regex-expert
Sven
Using this code:
def input = [ 'AD ABCDEFG HIJKLMN', 'AB HIJKLMN', 'AC DJKEJKW SJKLAJL JSHELSJ' ]
def regexp = /^(AB|AC|AD)|((\S{7})+)/
def result = input.collect {
matcher = ( it =~ regexp )
println "got $matcher.count for $it"
matcher.collect { it[0] }
}
println result
I get the output
got 3 for AD ABCDEFG HIJKLMN
got 2 for AB HIJKLMN
got 4 for AC DJKEJKW SJKLAJL JSHELSJ
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]
Is this more what you wanted?
This pattern will match your requirements
^A[BCD](?:\s\S{7})+
See it here online on Regexr
Meaning start with A then either a B or a C or a D. This is followed by at least one group consisting of a whitespace followed by 7 non whitespaces.