split string with negative regex pattern - regex

I want to split sting by non alphanumeric characters except a particular pattern .
Example :
string_1 = "section (ab) 5(a)"
string_2 = "section -bd, 6(1b)(2)"
string_3 = "section - ac - 12(c)"
string_4 = "Section (ab) 5(1a)(cf) (ad)"
string_5 = "section (ab) 5(a) test (ab) 5 6(ad)"
i want to split these strings in a way so that i can get bellow output
["section", "ab", "5(a)"]
["section", "bd", "6(1b)(2)"]
["section", "ac", "12(c)"]
["section", "ab", "5(1a)(cf)", "ad"]
["section", "ab", "5(a)", "test", "ab, "5", "6(ad)"]
To be more precise i want to split into every non-alphanumeric characters except this \d+([\w\(\)]+) pattern .

It can be achieved in this regex inside findall using:
\b\w+(?:\([^)]*\))*
RegEx Demo
Code:
>>> import re
>>> reg = re.compile(r'\b\w+(?:\([^)]*\))*')
>>> arr = ['section (ab) 5(a)', 'section -bd, 6(1b)(2)', 'section - ac - 12(c)', 'Section (ab) 5(1a)(cf) (ad)', 'section (ab) 5(a) test (ab) 5 6(ad)']
>>> for el in arr:
... print ( reg.findall(el) )
...
['section', 'ab', '5(a)']
['section', 'bd', '6(1b)(2)']
['section', 'ac', '12(c)']
['Section', 'ab', '5(1a)(cf)', 'ad']
['section', 'ab', '5(a)', 'test', 'ab', '5', '6(ad)']

You can use
\d+[\w()]+|\w+
See the regex demo.
Details
\d+[\w()]+ - 1+ digits and then 1+ word or ( or ) chars
| - or
\w+ - 1+ word chars.
In ElasticSearch, use
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\\d+[\\w()]+|\\w+",
"group": 0
}
}

Related

Break string into words using scan method + regexp, if word has `'` character, drop this character and everything after it

sample_string = "let's could've they'll you're won't"
sample_string.scan(/\w+/)
Above gives me:
["let", "s", "could", "ve", "they", "ll", "you", "re", "won", "t"]
What I want:
["let", "could", "they", "you", "won"]
Been playing around in https://rubular.com/ and trying assertions like \w+(?<=') but no luck.
Given:
> sample_string = "let's could've they'll you're won't"
You can do split and map:
> sample_string.split.map{|w| w.split(/'/)[0]}
=> ["let", "could", "they", "you", "won"]
You can use
sample_string.scan(/(?<![\w'])\w+/)
sample_string.scan(/\b(?<!')\w+/)
See the Rubular demo. The patterns (they are absolute synonyms) match
(?<![\w']) - a location in the string that is not immediately preceded with a word or ' char
\b(?<!') - a word boundary position which is not immediately preceded with a ' char
\w+ - one or more word chars.
See the Ruby demo:
sample_string = "let's could've they'll you're won't"
p sample_string.scan(/(?<![\w'])\w+/)
# => ["let", "could", "they", "you", "won"]

Regex capture optional group in any order

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
So the following:
"123 dog cat cow 456 678 890 sheep"
Would return the following:
[["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil], ["890", sheep]]
A regular expression can get us part of the way, but I do not believe all the way.
r = /
(?: # begin non-capture group
\d+ # match 1+ digits
[ ] # match 1 space
[^ \d]+ # match 1+ chars other than digits and spaces
| # or
[^ \d]+ # match 1+ chars other than digits and spaces
[ ] # match 1 space
\d+ # match 1+ digits
| # or
[^ ]+ # match 1+ chars other than spaces
) # end non-capture group
/x # free-spacing regex definition mode
str = "123 dog cat cow 456 678 890 sheep"
str.scan(r).map do |s|
case s
when /\d [^ \d]/
s.split(' ')
when /[^ \d] \d/
s.split(' ').reverse
when /\d/
[s,nil]
else
[nil,s]
end
end
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"],
# ["678", nil], ["890", "sheep"]]
Note:
str.scan r
#=> ["123 dog", "cat", "cow 456", "678", "890 sheep"]
This regular expression is conventionally written
/(?:\d+ [^ \d]+|[^ \d]+ \d+|[^ ]+)/
Here is another solution that only uses regular expressions incidentally.
def doit(str)
str.gsub(/[^ ]+/).with_object([]) do |s,a|
prev = a.empty? ? [0,'a'] : a.last
case s
when /\A\d+\z/ # all digits
if prev[0].nil?
a[-1][0] = s
else
a << [s,nil]
end
when /\A\D+\z/ # all non-digits
if prev[1].nil?
a[-1][1] = s
else
a << [nil,s]
end
else
raise ArgumentError
end
end
end
doit str
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil],
# ["890", "sheep"]]
This uses of the form of String#gsub that has no block and therefore returns an enumerator:
enum = str.gsub(/[^ ]+/)
#=> #<Enumerator: "123 dog cat cow 456 678 890 sheep":gsub(/[^ ]+/)>
enum.next
#=> "123"
enum.next
#=> "dog"
...
enum.next
#=> "sheep"
enum.next
#=> StopIteration (iteration reached an end)

Replace '-' with space if the next charcter is a letter not a digit and remove when it is at the start

I have a list of string i.e.
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
I want to remove the '-' from string where it is the first character and is followed by strings but not numbers or if before the '-' there is number/alphabet but after it is alphabets, then it should replace the '-' with space
So for the list slist I want the output as
["args", "-111111", "20 args", "20 - 20", "20-10", "args deep"]
I have tried
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
nlist = list()
for estr in slist:
nlist.append(re.sub("((^-[a-zA-Z])|([0-9]*-[a-zA-Z]))", "", estr))
print (nlist)
and i get the output
['rgs', '-111111', 'rgs', '20 - 20', '20-10', 'argseep']
You may use
nlist.append(re.sub(r"-(?=[a-zA-Z])", " ", estr).lstrip())
or
nlist.append(re.sub(r"-(?=[^\W\d_])", " ", estr).lstrip())
Result: ['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
See the Python demo.
The -(?=[a-zA-Z]) pattern matches a hyphen before an ASCII letter (-(?=[^\W\d_]) matches a hyphen before any letter), and replaces the match with a space. Since - may be matched at the start of a string, the space may appear at that position, so .lstrip() is used to remove the space(s) there.
Here, we might just want to capture the first letter after a starting -, then replace it with that letter only, maybe with an i flag expression similar to:
^-([a-z])
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^-([a-z])"
test_str = ("-args\n"
"-111111\n"
"20-args\n"
"20 - 20\n"
"20-10\n"
"args-deep")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /^-([a-z])/gmi;
const str = `-args
-111111
20-args
20 - 20
20-10
args-deep`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
One option could be to do 2 times a replacement. First match the hyphen at the start when there are only alphabets following:
^-(?=[a-zA-Z]+$)
Regex demo
In the replacement use an empty string.
Then capture 1 or more times an alphabet or digit in group 1, match - followed by capturing 1+ times an alphabet in group 2.
^([a-zA-Z0-9]+)-([a-zA-Z]+)$
Regex demo
In the replacement use r"\1 \2"
For example
import re
regex1 = r"^-(?=[a-zA-Z]+$)"
regex2 = r"^([a-zA-Z0-9]+)-([a-zA-Z]+)$"
slist = ["-args", "-111111", "20-args", "20 - 20", "20-10", "args-deep"]
slist = list(map(lambda s: re.sub(regex2, r"\1 \2", re.sub(regex1, "", s)), slist))
print(slist)
Result
['args', '-111111', '20 args', '20 - 20', '20-10', 'args deep']
Python demo

PhpStorm search and replace multiple times between two strings

In PhpStorm IDE, using the search and replace feature, I'm trying to add .jpg to all strings between quotes that come after $colorsfiles = [ and before the closing ].
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
If the "abc" is not in between $colorsfiles = [ and ], there should be no replacement.
The regex that I'm using is
$colorsfiles = \[("(\w*?)", )*
and replace string is
$colorsfiles = ["$2.jpg"]
The current result is
$colorsfiles = ["Brown.jpg"]"Sky Blue", "Silver"];
While the expected output is
$colorsfiles = ["Blue.jpg", "Red.jpg", "Orange.jpg", "Black.jpg", "White.jpg", "Golden.jpg", "Green.jpg", "Purple.jpg", "Yellow.jpg", "cyan.jpg", "Gray.jpg", "Pink.jpg", "Brown.jpg", "Sky Blue.jpg", "Silver.jpg"];
You should have said that you're trying it on IDE
Even though I don't use PHPStorm, I'm posting solution tested on my NetBeans.
Find : "([\w ]+)"([\,\]]{1})
Replace : "$1\.jpg"$2
why you need regex for this? a simple array_map() will do the trick for you.
<?php
function addExtension($color)
{
return $color.".jpg";
}
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
$colorsfiles_with_extension = array_map("addExtension", $colorsfiles);
print_r($colorsfiles_with_extension);
?>
Edit: I've tested it on my PhpStorm, let's do it like
search:
"([a-zA-Z\s]+)"
replace_all:
"$1.jpg"
You may use
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[")([^"]+)
and replace with $1$2.jpg. See this regex demo.
The regex matches $colorsfiles = [" or the end of the previous match followed with "," while capturing these texts into Group 1 (later referred to with $1 placeholder) and then captures into Group 2 (later referred to with $2) one or more chars other than a double quotation mark.
Details
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[") -
\G(?!^)",\s*" - the end of the previous match (\G(?!^)), ", substring, 0+ whitespaces (\s*) and a " char
| - or
\$colorsfiles\s*=\s*\[" - $colorsfiles, 0+ whitespaces (\s*), =, 0+ whitespaces, [" (note that $ and [ must be escaped to match literal chars)
([^"]+) - Capturing group 2: one or more (+) chars other than " (the negated character class, [^"])

Regex that matches specific spaces

I've been trying to do this Regex for a while now. I'd like to create one that matches all the spaces of a text, except those in literal string.
Exemple:
123 Foo "String with spaces"
Space between 123 and Foo would match, as well as the one between Foo and "String with spaces", but only those two.
Thanks
A common, simple strategy for this is to count the number of quotes leading up to your location in the string. If the count is odd, you are inside a quoted string; if the amount is even, you are outside a quoted string. I can't think of a way to do this in regular expressions, but you could use this strategy to filter the results.
You could use re.findall to match either a string or a space and then afterwards inspect the matches:
import re
hits = re.findall("\"(?:\\\\.|[^\\\"])*\"|[ ]", 'foo bar baz "another\\" test\" and done')
for h in hits:
print "found: [%s]" % h
yields:
found: [ ]
found: [ ]
found: [ ]
found: ["another\" test"]
found: [ ]
found: [ ]
A short explanation:
" # match a double quote
(?: # start non-capture group 1
\\\\. # match a backslash followed by any character (except line breaks)
| # OR
[^\\\"] # match any character except a '\' and '"'
)* # end non-capture group 1 and repeat it zero or more times
" # match a double quote
| # OR
[ ] # match a single space
If this ->123 Foo "String with spaces" <- is your structure for a line that is to say text followed by a quoted text you could create 2 groups the quoted and the unquoted text and an tackle them separately.
ex.regex -> (.*)(".*") where $1 should contain ->123 Foo <- and $2 ->"String with spaces"<-
java example.
String aux = "123 Foo \"String with spaces\"";
String regex = "(.*)(\".*\")";
String unquoted = aux.replaceAll(regex, "$1").replace(" ", "");
String quoted = aux.replaceAll(regex, "$2");
System.out.println(unquoted+quoted);
javascript example.
<SCRIPT LANGUAGE="JavaScript">
<!--
str='1 23 Foo \"String with spaces\"';
re = new RegExp('(.*)(".*")') ;
var quoted = str.replace(re, "$1");
var unquoted = str.replace(re, "$2");
document.write (quoted.split(' ').join('')+unquoted);
// -->
</SCRIPT>