Python split string without splitting escaped character - python-2.7

Is there a way to split a string without splitting escaped character? For example, I have a string and want to split by ':' and not by '\:'
http\://www.example.url:ftp\://www.example.url
The result should be the following:
['http\://www.example.url' , 'ftp\://www.example.url']

There is a much easier way using a regex with a negative lookbehind assertion:
re.split(r'(?<!\\):', str)

As Ignacio says, yes, but not trivially in one go. The issue is that you need lookback to determine if you're at an escaped delimiter or not, and the basic string.split doesn't provide that functionality.
If this isn't inside a tight loop so performance isn't a significant issue, you can do it by first splitting on the escaped delimiters, then performing the split, and then merging. Ugly demo code follows:
# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
# split by escaped, then by not-escaped
escaped_delim = '\\'+delim
sections = [p.split(delim) for p in s.split(escaped_delim)]
ret = []
prev = None
for parts in sections: # for each list of "real" splits
if prev is None:
if len(parts) > 1:
# Add first item, unless it's also the last in its section
ret.append(parts[0])
else:
# Add the previous last item joined to the first item
ret.append(escaped_delim.join([prev, parts[0]]))
for part in parts[1:-1]:
# Add all the items in the middle
ret.append(part)
prev = parts[-1]
return ret
s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':'))
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']
Alternately, it might be easier to follow the logic if you just split the string by hand.
def escaped_split(s, delim):
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == '\\':
try:
# skip the next character; it has been escaped!
current.append('\\')
current.append(next(itr))
except StopIteration:
pass
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
Note that this second version behaves slightly differently when it encounters double-escapes followed by a delimiter: this function allows escaped escape characters, so that escaped_split(r'a\\:b', ':') returns ['a\\\\', 'b'], because the first \ escapes the second one, leaving the : to be interpreted as a real delimiter. So that's something to watch out for.

The edited version of Henry's answer with Python3 compatibility, tests and fix some issues:
def split_unescape(s, delim, escape='\\', unescape=True):
"""
>>> split_unescape('foo,bar', ',')
['foo', 'bar']
>>> split_unescape('foo$,bar', ',', '$')
['foo,bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=True)
['foo$', 'bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=False)
['foo$$', 'bar']
>>> split_unescape('foo$', ',', '$', unescape=True)
['foo$']
"""
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == escape:
try:
# skip the next character; it has been escaped!
if not unescape:
current.append(escape)
current.append(next(itr))
except StopIteration:
if unescape:
current.append(escape)
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret

building on #user629923's suggestion, but being much simpler than other answers:
import re
DBL_ESC = "!double escape!"
s = r"Hello:World\:Goodbye\\:Cruel\\\:World"
map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))

Here is an efficient solution that handles double-escapes correctly, i.e. any subsequent delimiter is not escaped. It ignores an incorrect single-escape as the last character of the string.
It is very efficient because it iterates over the input string exactly once, manipulating indices instead of copying strings around. Instead of constructing a list, it returns a generator.
def split_esc(string, delimiter):
if len(delimiter) != 1:
raise ValueError('Invalid delimiter: ' + delimiter)
ln = len(string)
i = 0
j = 0
while j < ln:
if string[j] == '\\':
if j + 1 >= ln:
yield string[i:j]
return
j += 1
elif string[j] == delimiter:
yield string[i:j]
i = j + 1
j += 1
yield string[i:j]
To allow for delimiters longer than a single character, simply advance i and j by the length of the delimiter in the "elif" case. This assumes that a single escape character escapes the entire delimiter, rather than a single character.
Tested with Python 3.5.1.

There is no builtin function for that.
Here's an efficient, general and tested function, which even supports delimiters of any length:
def escape_split(s, delim):
i, res, buf = 0, [], ''
while True:
j, e = s.find(delim, i), 0
if j < 0: # end reached
return res + [buf + s[i:]] # add remainder
while j - e and s[j - e - 1] == '\\':
e += 1 # number of escapes
d = e // 2 # number of double escapes
if e != d * 2: # odd number of escapes
buf += s[i:j - d - 1] + s[j] # add the escaped char
i = j + 1 # and skip it
continue # add more to buf
res.append(buf + s[i:j - d])
i, buf = j + len(delim), '' # start after delim

I think a simple C like parsing would be much more simple and robust.
def escaped_split(str, ch):
if len(ch) > 1:
raise ValueError('Expected split character. Found string!')
out = []
part = ''
escape = False
for i in range(len(str)):
if not escape and str[i] == ch:
out.append(part)
part = ''
else:
part += str[i]
escape = not escape and str[i] == '\\'
if len(part):
out.append(part)
return out

I have created this method, which is inspired by Henry Keiter's answer, but has the following advantages:
Variable escape character and delimiter
Do not remove the escape character if it is actually not escaping something
This is the code:
def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
result = []
current_element = []
iterator = iter(string)
for character in iterator:
if character == self.release_indicator:
try:
next_character = next(iterator)
if next_character != delimiter and next_character != escape:
# Do not copy the escape character if it is inteded to escape either the delimiter or the
# escape character itself. Copy the escape character if it is not in use to escape one of these
# characters.
current_element.append(escape)
current_element.append(next_character)
except StopIteration:
current_element.append(escape)
elif character == delimiter:
# split! (add current to the list and reset it)
result.append(''.join(current_element))
current_element = []
else:
current_element.append(character)
result.append(''.join(current_element))
return result
This is test code indicating the behavior:
def test_split_string(self):
# Verify normal behavior
self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))
# Verify that escape character escapes the delimiter
self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))
# Verify that the escape character escapes the escape character
self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))
# Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))

I really know this is an old question, but i needed recently an function like this and not found any that was compliant with my requirements.
Rules:
Escape char only works when used with escape char or delimiter. Ex. if delimiter is / and escape are \ then (\a\b\c/abc bacame ['\a\b\c', 'abc']
Multiple escapes chars will be escaped. (\\ became \)
So, for the record and if someone look anything like, here my function proposal:
def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
"""Splits an string using delimiter and escape chars
Args:
str_to_escape ([type]): The text to be splitted
delimiter (str, optional): Delimiter used. Defaults to ','.
escape (str, optional): The escape char. Defaults to '\'.
Yields:
[type]: a list of string to be escaped
"""
if len(delimiter) > 1 or len(escape) > 1:
raise ValueError("Either delimiter or escape must be an one char value")
token = ''
escaped = False
for c in str_to_escape:
if c == escape:
if escaped:
token += escape
escaped = False
else:
escaped = True
continue
if c == delimiter:
if not escaped:
yield token
token = ''
else:
token += c
escaped = False
else:
if escaped:
token += escape
escaped = False
token += c
yield token
For the sake of sanity, i'm make some tests:
# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
('r/\\/teste/g', ['r', '/teste', 'g']),
('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
('r/\\.$//g', ['r', '\\.$', '', 'g']),
('u///g', ['u', '', '', 'g']),
('s/(/[/g', ['s', '(', '[', 'g']),
('s/)/]/g', ['s', ')', ']', 'g']),
('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]
tests_bar_escape = [
('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]
def test(test_array, escape):
"""From input data, test escape functions
Args:
test_array ([type]): [description]
escape ([type]): [description]
"""
for t in test_array:
resg = str_escape_split(t[0], '/', escape)
res = list(resg)
if res == t[1]:
print(f"Test {t[0]}: {res} - Pass!")
else:
print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")
def test_all():
test(tests_slash_escape, '\\')
test(tests_bar_escape, '|')
if __name__ == "__main__":
test_all()

Note that : doesn't appear to be a character that needs escaping.
The simplest way that I can think of to accomplish this is to split on the character, and then add it back in when it is escaped.
Sample code (In much need of some neatening.):
def splitNoEscapes(string, char):
sections = string.split(char)
sections = [i + (char if i[-1] == "\\" else "") for i in sections]
result = ["" for i in sections]
j = 0
for s in sections:
result[j] += s
j += (1 if s[-1] != char else 0)
return [i for i in result if i != ""]

Related

matching two or more characters that are not the same

Is it possible to write a regex pattern to match abc where each letter is not literal but means that text like xyz (but not xxy) would be matched? I am able to get as far as (.)(?!\1) to match a in ab but then I am stumped.
After getting the answer below, I was able to write a routine to generate this pattern. Using raw re patterns is much faster than converting both the pattern and a text to canonical form and then comaring them.
def pat2re(p, know=None, wild=None):
"""return a compiled re pattern that will find pattern `p`
in which each different character should find a different
character in a string. Characters to be taken literally
or that can represent any character should be given as
`know` and `wild`, respectively.
EXAMPLES
========
Characters in the pattern denote different characters to
be matched; characters that are the same in the pattern
must be the same in the text:
>>> pat = pat2re('abba')
>>> assert pat.search('maccaw')
>>> assert not pat.search('busses')
The underlying pattern of the re object can be seen
with the pattern property:
>>> pat.pattern
'(.)(?!\\1)(.)\\2\\1'
If some characters are to be taken literally, list them
as known; do the same if some characters can stand for
any character (i.e. are wildcards):
>>> a_ = pat2re('ab', know='a')
>>> assert a_.search('ad') and not a_.search('bc')
>>> ab_ = pat2re('ab*', know='ab', wild='*')
>>> assert ab_.search('abc') and ab_.search('abd')
>>> assert not ab_.search('bad')
"""
import re
# make a canonical "hash" of the pattern
# with ints representing pattern elements that
# must be unique and strings for wild or known
# values
m = {}
j = 1
know = know or ''
wild = wild or ''
for c in p:
if c in know:
m[c] = '\.' if c == '.' else c
elif c in wild:
m[c] = '.'
elif c not in m:
m[c] = j
j += 1
assert j < 100
h = tuple(m[i] for i in p)
# build pattern
out = []
last = 0
for i in h:
if type(i) is int:
if i <= last:
out.append(r'\%s' % i)
else:
if last:
ors = '|'.join(r'\%s' % i for i in range(1, last + 1))
out.append('(?!%s)(.)' % ors)
else:
out.append('(.)')
last = i
else:
out.append(i)
return re.compile(''.join(out))
You may try:
^(.)(?!\1)(.)(?!\1|\2).$
Demo
Here is an explanation of the regex pattern:
^ from the start of the string
(.) match and capture any first character (no restrictions so far)
(?!\1) then assert that the second character is different from the first
(.) match and capture any (legitimate) second character
(?!\1|\2) then assert that the third character does not match first or second
. match any valid third character
$ end of string

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

Converting python string to pig latin

def isAlpha(c):
return (ord(c) >= 65 and ord(c) <= 95) or \
(ord(c) >= 97 and ord(c) <= 122)
# testing first function
print isAlpha("D")
print isAlpha("z")
print isAlpha("!")
s = "AEIOUaeiou"
def isVowel(c):
return s.find(c) > -1
# testing second function
print isVowel("A")
print isVowel("B")
print isVowel("c")
print isVowel(" ")
print isVowel("a")
def convPigLatin_word(word):
if isVowel(word[0]):
word += "way"
while not isVowel(word[0]):
word = word[1:] + word[0]
if isVowel(word[0]):
word += "ay"
return word
# testing third function
print convPigLatin_word("This")
print convPigLatin_word("ayyyyyylmao")
def translate(phrase):
final = ""
while phrase.find(" ") != -1:
n = phrase.find(" ")
final += convPigLatin_word(phrase[0:n]) + " "
phrase = phrase[n+1:]
if phrase.find(" ") == -1:
final += convPigLatin_word(phrase)
return final
print translate("Hello, this is team Number Juan") #Should be "elloHay, isthay isway eamtay umberNay uanJay"
I tried to create a code that transform a string into pig latin. But I got stuck on the non-alphanumeric character. The while loop only works up to the comma. How can I resolve that? I don't know where to implement the isAlpha code to check for non alphanumeric character. Any advice is helpful.
You can iterate through the words of a phrase by using .split(' '). Then you can test them for special characters using .isalpha()
pigLatin = lambda word: word[1:]+ word[0]+"ay"
def testChars(word):
text = ""
for char in list(word):
if char.isalpha():
text += char
else:
return pigLatin(text)+ char
def testWords(lis):
words = []
lis = lis.split(' ')
for word in lis:
if not word.isalpha():
words.append( testChars(word) )
else:
words.append(pigLatin(word))
return (' ').join(words)
phrase = "I, have, lots of! special> characters;"
print testWords(phrase)

Extracting Numbers from a String Without Regular Expressions

I am trying to extract all the numbers from a string composed of digits, symbols and letters.
If the numbers are multi-digit, I have to extract them as multidigit (e.g. from "shsgd89shs2011%%5swts"), I have to pull the numbers out as they appear (89, 2011 and 5).
So far what I have done just loops through and returns all the numbers incrementally, which I like but I cannot figure out how to make it stop
after finishing with one set of digits:
def StringThings(strng):
nums = []
number = ""
for each in range(len(strng)):
if strng[each].isdigit():
number += strng[each]
else:
continue
nums.append(number)
return nums
Running this value: "6wtwyw66hgsgs" returns ['6', '66', '666']
w
hat simple way is there of breaking out of the loop once I have gotten what I needed?
Using your function, just use a temp variable to concat each sequence of digits, yielding the groups each time you encounter a non-digit if the temp variable is not an empty string:
def string_things(strng):
temp = ""
for ele in strng:
if ele.isdigit():
temp += ele
elif temp: # if we have a sequence
yield temp
temp = "" # reset temp
if temp: # catch ending sequence
yield temp
Output
In [9]: s = "shsgd89shs2011%%5swts"
In [10]: list(string_things(s))
Out[10]: ['89', '2011', '5']
In [11]: s ="67gobbledegook95"
In [12]: list(string_things(s))
Out[12]: ['67', '95']
Or you could translate the string replacing letters and punctuation with spaces then split:
from string import ascii_letters, punctuation, maketrans
s = "shsgd89shs2011%%5swts"
replace = ascii_letters+punctuation
tbl = maketrans(replace," " * len(replace))
print(s.translate(tbl).split())
['89', '2011', '5']
L2 = []
file_Name1 = 'shsgd89shs2011%%5swts'
from itertools import groupby
for k,g in groupby(file_Name1, str.isdigit):
a = list(g)
if k == 1:
L2.append("".join(a))
print(L2)
Result ['89', '2011', '5']
Updated to account for trailing numbers:
def StringThings(strng):
nums = []
number = ""
for each in range(len(strng)):
if strng[each].isdigit():
number += strng[each]
if each == len(strng)-1:
if number != '':
nums.append(number)
if each != 0:
if strng[each].isdigit() == False:
if strng[each-1].isdigit():
nums.append(number)
number = ""
continue;
return nums
print StringThings("shsgd89shs2011%%5swts34");
// returns ['89', '2011', '5', '34']
So, when we reach a character which is not a number, and if the previously observed character was a number, append the contents of number to nums and then simply empty our temporary container number, to avoid it containing all the old stuff.
Note, I don't know Python so the solution may not be very pythonic.
Alternatively, save yourself all the work and just do:
import re
print re.findall(r'\d+', 'shsgd89shs2011%%5swts');

Find a comma within a string?

Not sure if this is possible... but I need to find (and replace) all commas within strings, which I'm going to run on a PHP code file. i.e., something like "[^"]+,[^"]+" except that'll search on the wrong side of the strings too (the first quote is where a string ends, and the last one where it begins). I can run it multiple times to get all the commas, if necessary. I'm trying to use the Find-and-Replace feature in Komodo. This is a one-off job.
Well, here's my script so far, but it isn't working right. Worked on small test file, but on the full file its replacing commas outside of strings. Bah.
import sys, re
pattern = ','
replace = '~'
in_str = ''
out_str = ''
quote = None
in_file = open('infile.php', 'r')
out_file = open('outfile.php', 'w')
is_escaped = False # ...
while 1:
ch = in_file.read(1)
if not ch: break
if ch in ('"',"'"):
if quote is None:
quote = ch
elif quote == ch:
quote = None
out_file.write(out_str)
out_file.write(re.sub(pattern,replace,in_str))
in_str = ''
out_str = ''
if ch != quote and quote is not None:
in_str += ch
else:
out_str += ch
out_file.write(out_str)
out_file.write(in_str)
in_file.close()
out_file.close()
I take it your trying to find string literals in the PHP code (i.e. places in the code where someone has specified a string between quote marks: $somevar = "somevalue"; )
In this case, it may be easier to write a short piece of parsing code than a regex (since it will be complicated in the regex to distinguish the quote marks that begin a string literal from the quote marks that end it).
Some pseudocode:
inquote = false
while (!eof)
c = get_next_character()
if (c == QUOTE_MARK)
inquote = !inquote
if (c == COMMA)
if (inquote)
delete_current_character()