I want to split sting by non alphanumeric characters except a particular pattern .
Example :
string_1 = "section (ab) 5(a)"
string_2 = "section -bd, 6(1b)(2)"
string_3 = "section - ac - 12(c)"
string_4 = "Section (ab) 5(1a)(cf) (ad)"
string_5 = "section (ab) 5(a) test (ab) 5 6(ad)"
i want to split these strings in a way so that i can get bellow output
["section", "ab", "5(a)"]
["section", "bd", "6(1b)(2)"]
["section", "ac", "12(c)"]
["section", "ab", "5(1a)(cf)", "ad"]
["section", "ab", "5(a)", "test", "ab, "5", "6(ad)"]
To be more precise i want to split into every non-alphanumeric characters except this \d+([\w\(\)]+) pattern .
It can be achieved in this regex inside findall using:
\b\w+(?:\([^)]*\))*
RegEx Demo
Code:
>>> import re
>>> reg = re.compile(r'\b\w+(?:\([^)]*\))*')
>>> arr = ['section (ab) 5(a)', 'section -bd, 6(1b)(2)', 'section - ac - 12(c)', 'Section (ab) 5(1a)(cf) (ad)', 'section (ab) 5(a) test (ab) 5 6(ad)']
>>> for el in arr:
... print ( reg.findall(el) )
...
['section', 'ab', '5(a)']
['section', 'bd', '6(1b)(2)']
['section', 'ac', '12(c)']
['Section', 'ab', '5(1a)(cf)', 'ad']
['section', 'ab', '5(a)', 'test', 'ab', '5', '6(ad)']
You can use
\d+[\w()]+|\w+
See the regex demo.
Details
\d+[\w()]+ - 1+ digits and then 1+ word or ( or ) chars
| - or
\w+ - 1+ word chars.
In ElasticSearch, use
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\\d+[\\w()]+|\\w+",
"group": 0
}
}
Related
sample_string = "let's could've they'll you're won't"
sample_string.scan(/\w+/)
Above gives me:
["let", "s", "could", "ve", "they", "ll", "you", "re", "won", "t"]
What I want:
["let", "could", "they", "you", "won"]
Been playing around in https://rubular.com/ and trying assertions like \w+(?<=') but no luck.
Given:
> sample_string = "let's could've they'll you're won't"
You can do split and map:
> sample_string.split.map{|w| w.split(/'/)[0]}
=> ["let", "could", "they", "you", "won"]
You can use
sample_string.scan(/(?<![\w'])\w+/)
sample_string.scan(/\b(?<!')\w+/)
See the Rubular demo. The patterns (they are absolute synonyms) match
(?<![\w']) - a location in the string that is not immediately preceded with a word or ' char
\b(?<!') - a word boundary position which is not immediately preceded with a ' char
\w+ - one or more word chars.
See the Ruby demo:
sample_string = "let's could've they'll you're won't"
p sample_string.scan(/(?<![\w'])\w+/)
# => ["let", "could", "they", "you", "won"]
I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
So the following:
"123 dog cat cow 456 678 890 sheep"
Would return the following:
[["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil], ["890", sheep]]
A regular expression can get us part of the way, but I do not believe all the way.
r = /
(?: # begin non-capture group
\d+ # match 1+ digits
[ ] # match 1 space
[^ \d]+ # match 1+ chars other than digits and spaces
| # or
[^ \d]+ # match 1+ chars other than digits and spaces
[ ] # match 1 space
\d+ # match 1+ digits
| # or
[^ ]+ # match 1+ chars other than spaces
) # end non-capture group
/x # free-spacing regex definition mode
str = "123 dog cat cow 456 678 890 sheep"
str.scan(r).map do |s|
case s
when /\d [^ \d]/
s.split(' ')
when /[^ \d] \d/
s.split(' ').reverse
when /\d/
[s,nil]
else
[nil,s]
end
end
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"],
# ["678", nil], ["890", "sheep"]]
Note:
str.scan r
#=> ["123 dog", "cat", "cow 456", "678", "890 sheep"]
This regular expression is conventionally written
/(?:\d+ [^ \d]+|[^ \d]+ \d+|[^ ]+)/
Here is another solution that only uses regular expressions incidentally.
def doit(str)
str.gsub(/[^ ]+/).with_object([]) do |s,a|
prev = a.empty? ? [0,'a'] : a.last
case s
when /\A\d+\z/ # all digits
if prev[0].nil?
a[-1][0] = s
else
a << [s,nil]
end
when /\A\D+\z/ # all non-digits
if prev[1].nil?
a[-1][1] = s
else
a << [nil,s]
end
else
raise ArgumentError
end
end
end
doit str
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil],
# ["890", "sheep"]]
This uses of the form of String#gsub that has no block and therefore returns an enumerator:
enum = str.gsub(/[^ ]+/)
#=> #<Enumerator: "123 dog cat cow 456 678 890 sheep":gsub(/[^ ]+/)>
enum.next
#=> "123"
enum.next
#=> "dog"
...
enum.next
#=> "sheep"
enum.next
#=> StopIteration (iteration reached an end)
I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo
In PhpStorm IDE, using the search and replace feature, I'm trying to add .jpg to all strings between quotes that come after $colorsfiles = [ and before the closing ].
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
If the "abc" is not in between $colorsfiles = [ and ], there should be no replacement.
The regex that I'm using is
$colorsfiles = \[("(\w*?)", )*
and replace string is
$colorsfiles = ["$2.jpg"]
The current result is
$colorsfiles = ["Brown.jpg"]"Sky Blue", "Silver"];
While the expected output is
$colorsfiles = ["Blue.jpg", "Red.jpg", "Orange.jpg", "Black.jpg", "White.jpg", "Golden.jpg", "Green.jpg", "Purple.jpg", "Yellow.jpg", "cyan.jpg", "Gray.jpg", "Pink.jpg", "Brown.jpg", "Sky Blue.jpg", "Silver.jpg"];
You should have said that you're trying it on IDE
Even though I don't use PHPStorm, I'm posting solution tested on my NetBeans.
Find : "([\w ]+)"([\,\]]{1})
Replace : "$1\.jpg"$2
why you need regex for this? a simple array_map() will do the trick for you.
<?php
function addExtension($color)
{
return $color.".jpg";
}
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
$colorsfiles_with_extension = array_map("addExtension", $colorsfiles);
print_r($colorsfiles_with_extension);
?>
Edit: I've tested it on my PhpStorm, let's do it like
search:
"([a-zA-Z\s]+)"
replace_all:
"$1.jpg"
You may use
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[")([^"]+)
and replace with $1$2.jpg. See this regex demo.
The regex matches $colorsfiles = [" or the end of the previous match followed with "," while capturing these texts into Group 1 (later referred to with $1 placeholder) and then captures into Group 2 (later referred to with $2) one or more chars other than a double quotation mark.
Details
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[") -
\G(?!^)",\s*" - the end of the previous match (\G(?!^)), ", substring, 0+ whitespaces (\s*) and a " char
| - or
\$colorsfiles\s*=\s*\[" - $colorsfiles, 0+ whitespaces (\s*), =, 0+ whitespaces, [" (note that $ and [ must be escaped to match literal chars)
([^"]+) - Capturing group 2: one or more (+) chars other than " (the negated character class, [^"])
I've been trying to do this Regex for a while now. I'd like to create one that matches all the spaces of a text, except those in literal string.
Exemple:
123 Foo "String with spaces"
Space between 123 and Foo would match, as well as the one between Foo and "String with spaces", but only those two.
Thanks
A common, simple strategy for this is to count the number of quotes leading up to your location in the string. If the count is odd, you are inside a quoted string; if the amount is even, you are outside a quoted string. I can't think of a way to do this in regular expressions, but you could use this strategy to filter the results.
You could use re.findall to match either a string or a space and then afterwards inspect the matches:
import re
hits = re.findall("\"(?:\\\\.|[^\\\"])*\"|[ ]", 'foo bar baz "another\\" test\" and done')
for h in hits:
print "found: [%s]" % h
yields:
found: [ ]
found: [ ]
found: [ ]
found: ["another\" test"]
found: [ ]
found: [ ]
A short explanation:
" # match a double quote
(?: # start non-capture group 1
\\\\. # match a backslash followed by any character (except line breaks)
| # OR
[^\\\"] # match any character except a '\' and '"'
)* # end non-capture group 1 and repeat it zero or more times
" # match a double quote
| # OR
[ ] # match a single space
If this ->123 Foo "String with spaces" <- is your structure for a line that is to say text followed by a quoted text you could create 2 groups the quoted and the unquoted text and an tackle them separately.
ex.regex -> (.*)(".*") where $1 should contain ->123 Foo <- and $2 ->"String with spaces"<-
java example.
String aux = "123 Foo \"String with spaces\"";
String regex = "(.*)(\".*\")";
String unquoted = aux.replaceAll(regex, "$1").replace(" ", "");
String quoted = aux.replaceAll(regex, "$2");
System.out.println(unquoted+quoted);
javascript example.
<SCRIPT LANGUAGE="JavaScript">
<!--
str='1 23 Foo \"String with spaces\"';
re = new RegExp('(.*)(".*")') ;
var quoted = str.replace(re, "$1");
var unquoted = str.replace(re, "$2");
document.write (quoted.split(' ').join('')+unquoted);
// -->
</SCRIPT>