Regex expression to parse an interesting CSV? - regex

I need to parse an CSV file using AWK. A line in the CSV could look like this:
"hello, world?",1 thousand,"oneword",,,"last one"
Some important observations:
-field inside quoted string can contain commas and multiple words
-unquoted field can be multiple worlds
-field can be empty by just having two commas in a row
Any clues on writing a regex expression to split this line up properly?
Thanks!

As many have observed, CSV is a harder format than it first appears. There are many edge cases and ambiguities. As an example ambiguity, in your example, is ',,,' a field with a comma or two blank fields?
Perl, python, Java, etc are better equipped to deal with CSV because they have well tested libraries for the same. A regex will be more fragile.
With AWK, I have had some success with THIS AWK function. It works under AWK, gawk and nawk.
#!/usr/bin/awk -f
#**************************************************************************
#
# This file is in the public domain.
#
# For more information email LoranceStinson+csv#gmail.com.
# Or see http://lorance.freeshell.org/csv/
#
# Parse a CSV string into an array.
# The number of fields found is returned.
# In the event of an error a negative value is returned and csverr is set to
# the error. See below for the error values.
#
# Parameters:
# string = The string to parse.
# csv = The array to parse the fields into.
# sep = The field separator character. Normally ,
# quote = The string quote character. Normally "
# escape = The quote escape character. Normally "
# newline = Handle embedded newlines. Provide either a newline or the
# string to use in place of a newline. If left empty embedded
# newlines cause an error.
# trim = When true spaces around the separator are removed.
# This affects parsing. Without this a space between the
# separator and quote result in the quote being ignored.
#
# These variables are private:
# fields = The number of fields found thus far.
# pos = Where to pull a field from the string.
# strtrim = True when a string is found so we know to remove the quotes.
#
# Error conditions:
# -1 = Unable to read the next line.
# -2 = Missing end quote.
# -3 = Missing separator.
#
# Notes:
# The code assumes that every field is preceded by a separator, even the
# first field. This makes the logic much simpler, but also requires a
# separator be prepended to the string before parsing.
#**************************************************************************
function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) {
# Make sure there is something to parse.
if (length(string) == 0) return 0;
string = sep string; # The code below assumes ,FIELD.
fields = 0; # The number of fields found thus far.
while (length(string) > 0) {
# Remove spaces after the separator if requested.
if (trim && substr(string, 2, 1) == " ") {
if (length(string) == 1) return fields;
string = substr(string, 2);
continue;
}
strtrim = 0; # Used to trim quotes off strings.
# Handle a quoted field.
if (substr(string, 2, 1) == quote) {
pos = 2;
do {
pos++
if (pos != length(string) &&
substr(string, pos, 1) == escape &&
(substr(string, pos + 1, 1) == quote ||
substr(string, pos + 1, 1) == escape)) {
# Remove escaped quote characters.
string = substr(string, 1, pos - 1) substr(string, pos + 1);
} else if (substr(string, pos, 1) == quote) {
# Found the end of the string.
strtrim = 1;
} else if (newline && pos >= length(string)) {
# Handle embedded newlines if requested.
if (getline == -1) {
csverr = "Unable to read the next line.";
return -1;
}
string = string newline $0;
}
} while (pos < length(string) && strtrim == 0)
if (strtrim == 0) {
csverr = "Missing end quote.";
return -2;
}
} else {
# Handle an empty field.
if (length(string) == 1 || substr(string, 2, 1) == sep) {
csv[fields] = "";
fields++;
if (length(string) == 1)
return fields;
string = substr(string, 2);
continue;
}
# Search for a separator.
pos = index(substr(string, 2), sep);
# If there is no separator the rest of the string is a field.
if (pos == 0) {
csv[fields] = substr(string, 2);
fields++;
return fields;
}
}
# Remove spaces after the separator if requested.
if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") {
trim = strtrim
# Count the number fo spaces found.
while (pos < length(string) && substr(string, pos + trim, 1) == " ") {
trim++
}
# Remove them from the string.
string = substr(string, 1, pos + strtrim - 1) substr(string, pos + trim);
# Adjust pos with the trimmed spaces if a quotes string was not found.
if (!strtrim) {
pos -= trim;
}
}
# Make sure we are at the end of the string or there is a separator.
if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) {
csverr = "Missing separator.";
return -3;
}
# Gather the field.
csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2));
fields++;
# Remove the field from the string for the next pass.
string = substr(string, pos + 1);
}
return fields;
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
if (num_fields < 0) {
printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
} else {
printf "%s -> \n", $0;
printf "%s fields\n", num_fields;
for (i = 0;i < num_fields;i++) {
printf "%s\n", csv[i];
}
printf "|\n";
}
}
Running it on your example data produces:
"hello, world?",1 thousand,"oneword",,,"last one" ->
6 fields
hello, world?
1 thousand
oneword
last one
|
An example Perl solution:
$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' |
perl -lnE 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//);
say}'

Try this:
^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
I haven't tested it with AWK though.

Related

Don't allow consecutive same character and some different character together in TextFormField

I want to restrict the TextFormField to only accept numbers separated by commas and sometimes with dashes but I don't want them to come consecutive to each other and also don't want the same character consecutive.
Ex:-
1,3-4,9-11 is correct
1,,3--4,9-11 is wrong
1,-3-4,9-11 is wrong
1-,3-4,9-11 is wrong
To restrict things to only numbers, commas and dashes I'm using:-
FilteringTextInputFormatter(
RegExp("[0-9,-]"),
allow: true
)
But it is not restricting the consecutive behavior as shown in the wrong behavior in the examples.
So, how can I restrict my TextFormField to the correct behavior represented in the examples?
Thank you.
Update: I finally followed this approach for this problem.
If you want to validate on submit, you might write the pattern as:
^[0-9]+(?:[,-][0-9]+)*$
Regex demo
If a negative lookahead is supported, you an exclude matching 2 times one of - or , while validating on typing.
Note that this will allow , or - at the end:
^(?!.*[,-][,-])[0-9,-]*
Regex demo
For my problem above I finally combined FilteringTextInputFormatter with a custom TextInputFormatter specific to my case so I'm adding it below so that if anyone wants to do the same they can have a look at this approach:
class RangeTextInputFormatter extends TextInputFormatter {
#override
TextEditingValue formatEditUpdate(
TextEditingValue oldValue,
TextEditingValue newValue,
) {
TextSelection newSelection = newValue.selection;
String truncated = newValue.text;
int oldValueLength = oldValue.text.length;
int newValueLength = newValue.text.length;
// Blocks comma and dash at start.
if ((oldValue.text.isEmpty || oldValue.text == "") &&
(newValue.text[newValueLength - 1] == "," ||
newValue.text[newValueLength - 1] == "-")) {
truncated = oldValue.text;
newSelection = oldValue.selection;
}
// Allows numbers at start.
else if (oldValue.text.isEmpty || oldValue.text == "") {
truncated = newValue.text;
newSelection = newValue.selection;
} else {
// Blocks comma and dash after comma.
if (oldValue.text[oldValueLength - 1] == "," &&
(newValue.text[newValueLength - 1] == "," ||
newValue.text[newValueLength - 1] == "-")) {
truncated = oldValue.text;
newSelection = oldValue.selection;
}
// Blocks comma and dash after dash.
else if (oldValue.text[oldValueLength - 1] == "-" &&
(newValue.text[newValueLength - 1] == "," ||
newValue.text[newValueLength - 1] == "-")) {
truncated = oldValue.text;
newSelection = oldValue.selection;
}
// Blocks dash after number dash number. Ex: 48-58- <- this last dash is blocked
else if (oldValue.text.lastIndexOf('-') != -1) {
if (!(oldValue.text
.substring(oldValue.text.lastIndexOf('-'))
.contains(",")) &&
newValue.text[newValueLength - 1] == "-") {
truncated = oldValue.text;
newSelection = oldValue.selection;
}
}
}
return TextEditingValue(
text: truncated,
selection: newSelection,
composing: TextRange.empty,
);
}
}
Now use it just like FilteringTextInputFormatter:
inputFormatters: [
FilteringTextInputFormatter(RegExp("[0-9,-]"), allow: true),
RangeTextInputFormatter(),
]

Regex to match despite some of the characters not matching pattern?

I'm working with some bioinformatics data, and I've got this sed expression:
sed -n 'N;/.*:\(.*\)\n.*\1/{p;n;p;n;p};D' file.txt
It currently takes a file that is structured such as:
#E00378:1485 1:N:0:ABC
ABCDEF ##should match, all characters present
+
#
#E00378:1485 1:N:1:ABC
XYZABX ##should match, with permutation
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH ##should match, with permutation
+
#
#E00378:1485 1:N:1:CBA
ABC ##should not match, order not preserved
+
#
Then it returns 4 lines if the sequence after : is found in the second line, so in this case I would get:
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
However, I am looking to expand my search a little, by adding the possibility of searching for any single permutation of the letters, while maintaining the order, such that ABX, ZBC, AHC, ABO would all match the search criteria ABC.
Is a search like this possible to construct as a one-liner? Or should I write a script?
I was thinking it should be possible to programmatically change one of the letters to a * in the pattern space.
I am trying to make something along the lines of an AWK pattern that has a match defined as:
p = "";
p = p "."a[2]a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]"."a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]"."a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]"."a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]"."a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]"."a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]"."a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]a[7]".";
m = p;
But I can't seem to figure out how to make it programmatically for n numbers.
Okay, check this out where fuzzy is your input above:
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$offset $edits $distance\n";}' fuzzy
3 kkk 0
5 kkd 1
5 kkkkd 1
Since you haven't been 100% clear on your "fuzziness" criteria (and can't be until you have a measurement tool), I'll explain this first. Reference here:
http://search.cpan.org/~bkb/Text-Fuzzy-0.27/lib/Text/Fuzzy.pod
Basically, for each record (which I've assumed are split on # which is the -0043 bit), the output is an offset, how the 1st string can become the 2nd string, and lastly the "distance" (Levenshtein, I would assume) between the two strings.
So..
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$_\n" if $distance < 2;}' fuzzy
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
#E00378:1485 1:N:1:ABC
XYZABX
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH
+
#
See here for installing perl modules like Text::Fuzzy
https://www.thegeekstuff.com/2008/09/how-to-install-perl-modules-manually-and-using-cpan-command/
Example input/output for a record that wouldn't be printed (distance is 3):
#E00378:1485 1:N:1:ABCDE
ZDEFDXFGH
+
#
gives us this (or simply doesn't print with the second perl command)
3 dddkk 3
Awk doesn't have sed back-references, but has more expressiveness to make up the difference. The following script composes the pattern for matching from the final field of the lead line, then applies the pattern to the subsequent line.
#! /usr/bin/awk -f
BEGIN {
FS = ":"
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}
If the above needs some work to form a different matching pattern, we mostly limit our modification to the lines of pattern composition. By the way... I noticed that sequences repeat -- to make this faster we can implement caching:
#! /usr/bin/awk -f
BEGIN {
FS = ":"
# Noticed that sequences repeat
# -- implement caching of patterns
split("", cache)
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
if (seq in cache) {
pat = cache[seq]
} else {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
cache[seq] = pat
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}

Regular Expression for conditional replacement of parts of string in US phone number mask (Swift compatible)

I try to come up with regular expression patter that fulfills such requirements.
it is US phone number format wit 3 groups
I have input strings like this
(999) 98__-9999 here there is extra _ at the end of second section which I want to delete
(999) 9_8_-9999 here there is extra _ at the end of second section I want to delete
(999) 9_-9999 here if second group length is < 3 and ends with _ there should be added _ to pad second group to 9__ (3 characters)
(999) 98-9999 here if second group length is equal to 3 or it ends with digit there shouldn't be any modifications
To sum up:
If secondGroup.length > 3 && secondGroup.lastCharacter == '_' I want to remove this last character
else if secondGroup.length < 3 && secondGroup.lastCharacter == '_' I wan to append "_" (or pad wit underscore to have 3 characters in total)
else leave second group as in the input string.
The same should be applied to first group. The difference are the different delimiters i.e. (xxx) in first group and \sxxx- in second group
Here is my Swift code I have used to achieve it in brute force way by manually manipulating the string: (length 4 instead of 3 takes into account first delimiter like ( or \s. )
var componentText = ""
let idx1 = newText.index(of: "(")
let idx2 = newText.index(of: ")")
if let idx1 = idx1, let idx2 = idx2 {
var component0 = newText[..<idx1]
var component1 = newText[idx1..<idx2]
if component1.count > 4 && component1.last == "_" {
component1.popLast()
} else if component1.count < 4 && component1.last == "_" {
component1.append("_")
}
componentText += "\(component0)\(component1))"
} else {
componentText = newText
}
let idx3 = newText.index(of: " ")
let idx4 = newText.index(of: "-")
if let idx2 = idx2, let idx3 = idx3, let idx4 = idx4 {
var component2 = newText[idx2..<idx3]
component2.popFirst()
var component3 = newText[idx3..<idx4]
var component4 = newText[idx4...]
if component3.count > 4 && component3.last == "_" {
component3.popLast()
} else if component3.count < 4 && component3.last == "_" {
component3.append("_")
}
componentText += "\(component2) \(component3)-\(component4)"
} else {
componentText = newText
}
newText = componentText != "" ? componentText : newText
I think that using regular expression this code could be more flexible and much shorter.

Regular Expression to detect and escape multiple single quotes, other than reserved by JSON

I want to have some Regular Expression which may detect and escape multiple single quotes with double black slash (\\). For example if there is ' then it should become \\'
Challenge here is that:
1) It should NOT escape those single quotes which are used by JSON.
Example Below:
{'Key1':'Value1','Key2':'Value2'}
It should not escape single quotes which are covering keys and values. In above example, none of the quotes should be escaped.
Any single quotes inside values should be escaped.
2) It should escape MULTIPLE single quotes which are present there inside Value (in some Key value Pair).
Here is the Challenge String which can be used as an example:
Challenge String:
{'AddressUsageId':''asd'','Edit':'Edit','SiteUsage':'Bi'llTo','PaymentTerm':'asd','SalesPerson':'S'A#,#$'%^''&*'()<>?`~','Language':'','PrimaryUsage':''''','InternalLocation':'T'est'}
It should be escaped like below:
{'AddressUsageId':'\'asd\'','Edit':'Edit','SiteUsage':'Bi\'llTo','PaymentTerm':'asd','SalesPerson':'S\'A#,#$\'%^\'\'&*\'()<>?`~','Language':'','PrimaryUsage':'\'\'\'','InternalLocation':'T\'est'}
Single quotes are NOT valid JSON. If you pull your string through jsonlint, it will show you that. The proper way of making a JSON string in PHP is by using json_encode() on an Array or Object. This will automatically escape quotes if they need to be escaped.
As far as your problem goes. Use the following pseudocode:
$s = $json_string without first 2 and last 2 characters
#$a is array of "key':'value"
$a = explode( $s, "','" );
foreach( $a as $i => $keyvalue ) {
$temp = explode( $keyvalue, "':'" );
#Now replace all instances of ' with \'
$temp = str_replace( "'", "\'", $temp );
#Now do something fancy to stitch everything back together.
}
<script>
function removeSingleQuotesFromJSON(str)
{
var array = str.split('');
var strLength = str.length;
var resultStr= "";
for(var i=0; i<strLength; i++)
{
if(i>0)
{
if(array[i] == "'" && array[i-1] != "{" && array[i+1] != "}" && array[i+1] != ":" && array[i-1] != ":" && !(array[i+1] == "," && array[i+2] == "'") && !(array[i-1] == "," && array[i-2] == "'"))
{
resultStr+="\\" ;
}
}
resultStr+=""+array[i];
}
return resultStr;
} </script>

Python split string without splitting escaped character

Is there a way to split a string without splitting escaped character? For example, I have a string and want to split by ':' and not by '\:'
http\://www.example.url:ftp\://www.example.url
The result should be the following:
['http\://www.example.url' , 'ftp\://www.example.url']
There is a much easier way using a regex with a negative lookbehind assertion:
re.split(r'(?<!\\):', str)
As Ignacio says, yes, but not trivially in one go. The issue is that you need lookback to determine if you're at an escaped delimiter or not, and the basic string.split doesn't provide that functionality.
If this isn't inside a tight loop so performance isn't a significant issue, you can do it by first splitting on the escaped delimiters, then performing the split, and then merging. Ugly demo code follows:
# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
# split by escaped, then by not-escaped
escaped_delim = '\\'+delim
sections = [p.split(delim) for p in s.split(escaped_delim)]
ret = []
prev = None
for parts in sections: # for each list of "real" splits
if prev is None:
if len(parts) > 1:
# Add first item, unless it's also the last in its section
ret.append(parts[0])
else:
# Add the previous last item joined to the first item
ret.append(escaped_delim.join([prev, parts[0]]))
for part in parts[1:-1]:
# Add all the items in the middle
ret.append(part)
prev = parts[-1]
return ret
s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':'))
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']
Alternately, it might be easier to follow the logic if you just split the string by hand.
def escaped_split(s, delim):
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == '\\':
try:
# skip the next character; it has been escaped!
current.append('\\')
current.append(next(itr))
except StopIteration:
pass
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
Note that this second version behaves slightly differently when it encounters double-escapes followed by a delimiter: this function allows escaped escape characters, so that escaped_split(r'a\\:b', ':') returns ['a\\\\', 'b'], because the first \ escapes the second one, leaving the : to be interpreted as a real delimiter. So that's something to watch out for.
The edited version of Henry's answer with Python3 compatibility, tests and fix some issues:
def split_unescape(s, delim, escape='\\', unescape=True):
"""
>>> split_unescape('foo,bar', ',')
['foo', 'bar']
>>> split_unescape('foo$,bar', ',', '$')
['foo,bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=True)
['foo$', 'bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=False)
['foo$$', 'bar']
>>> split_unescape('foo$', ',', '$', unescape=True)
['foo$']
"""
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == escape:
try:
# skip the next character; it has been escaped!
if not unescape:
current.append(escape)
current.append(next(itr))
except StopIteration:
if unescape:
current.append(escape)
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
building on #user629923's suggestion, but being much simpler than other answers:
import re
DBL_ESC = "!double escape!"
s = r"Hello:World\:Goodbye\\:Cruel\\\:World"
map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))
Here is an efficient solution that handles double-escapes correctly, i.e. any subsequent delimiter is not escaped. It ignores an incorrect single-escape as the last character of the string.
It is very efficient because it iterates over the input string exactly once, manipulating indices instead of copying strings around. Instead of constructing a list, it returns a generator.
def split_esc(string, delimiter):
if len(delimiter) != 1:
raise ValueError('Invalid delimiter: ' + delimiter)
ln = len(string)
i = 0
j = 0
while j < ln:
if string[j] == '\\':
if j + 1 >= ln:
yield string[i:j]
return
j += 1
elif string[j] == delimiter:
yield string[i:j]
i = j + 1
j += 1
yield string[i:j]
To allow for delimiters longer than a single character, simply advance i and j by the length of the delimiter in the "elif" case. This assumes that a single escape character escapes the entire delimiter, rather than a single character.
Tested with Python 3.5.1.
There is no builtin function for that.
Here's an efficient, general and tested function, which even supports delimiters of any length:
def escape_split(s, delim):
i, res, buf = 0, [], ''
while True:
j, e = s.find(delim, i), 0
if j < 0: # end reached
return res + [buf + s[i:]] # add remainder
while j - e and s[j - e - 1] == '\\':
e += 1 # number of escapes
d = e // 2 # number of double escapes
if e != d * 2: # odd number of escapes
buf += s[i:j - d - 1] + s[j] # add the escaped char
i = j + 1 # and skip it
continue # add more to buf
res.append(buf + s[i:j - d])
i, buf = j + len(delim), '' # start after delim
I think a simple C like parsing would be much more simple and robust.
def escaped_split(str, ch):
if len(ch) > 1:
raise ValueError('Expected split character. Found string!')
out = []
part = ''
escape = False
for i in range(len(str)):
if not escape and str[i] == ch:
out.append(part)
part = ''
else:
part += str[i]
escape = not escape and str[i] == '\\'
if len(part):
out.append(part)
return out
I have created this method, which is inspired by Henry Keiter's answer, but has the following advantages:
Variable escape character and delimiter
Do not remove the escape character if it is actually not escaping something
This is the code:
def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
result = []
current_element = []
iterator = iter(string)
for character in iterator:
if character == self.release_indicator:
try:
next_character = next(iterator)
if next_character != delimiter and next_character != escape:
# Do not copy the escape character if it is inteded to escape either the delimiter or the
# escape character itself. Copy the escape character if it is not in use to escape one of these
# characters.
current_element.append(escape)
current_element.append(next_character)
except StopIteration:
current_element.append(escape)
elif character == delimiter:
# split! (add current to the list and reset it)
result.append(''.join(current_element))
current_element = []
else:
current_element.append(character)
result.append(''.join(current_element))
return result
This is test code indicating the behavior:
def test_split_string(self):
# Verify normal behavior
self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))
# Verify that escape character escapes the delimiter
self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))
# Verify that the escape character escapes the escape character
self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))
# Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))
I really know this is an old question, but i needed recently an function like this and not found any that was compliant with my requirements.
Rules:
Escape char only works when used with escape char or delimiter. Ex. if delimiter is / and escape are \ then (\a\b\c/abc bacame ['\a\b\c', 'abc']
Multiple escapes chars will be escaped. (\\ became \)
So, for the record and if someone look anything like, here my function proposal:
def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
"""Splits an string using delimiter and escape chars
Args:
str_to_escape ([type]): The text to be splitted
delimiter (str, optional): Delimiter used. Defaults to ','.
escape (str, optional): The escape char. Defaults to '\'.
Yields:
[type]: a list of string to be escaped
"""
if len(delimiter) > 1 or len(escape) > 1:
raise ValueError("Either delimiter or escape must be an one char value")
token = ''
escaped = False
for c in str_to_escape:
if c == escape:
if escaped:
token += escape
escaped = False
else:
escaped = True
continue
if c == delimiter:
if not escaped:
yield token
token = ''
else:
token += c
escaped = False
else:
if escaped:
token += escape
escaped = False
token += c
yield token
For the sake of sanity, i'm make some tests:
# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
('r/\\/teste/g', ['r', '/teste', 'g']),
('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
('r/\\.$//g', ['r', '\\.$', '', 'g']),
('u///g', ['u', '', '', 'g']),
('s/(/[/g', ['s', '(', '[', 'g']),
('s/)/]/g', ['s', ')', ']', 'g']),
('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]
tests_bar_escape = [
('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]
def test(test_array, escape):
"""From input data, test escape functions
Args:
test_array ([type]): [description]
escape ([type]): [description]
"""
for t in test_array:
resg = str_escape_split(t[0], '/', escape)
res = list(resg)
if res == t[1]:
print(f"Test {t[0]}: {res} - Pass!")
else:
print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")
def test_all():
test(tests_slash_escape, '\\')
test(tests_bar_escape, '|')
if __name__ == "__main__":
test_all()
Note that : doesn't appear to be a character that needs escaping.
The simplest way that I can think of to accomplish this is to split on the character, and then add it back in when it is escaped.
Sample code (In much need of some neatening.):
def splitNoEscapes(string, char):
sections = string.split(char)
sections = [i + (char if i[-1] == "\\" else "") for i in sections]
result = ["" for i in sections]
j = 0
for s in sections:
result[j] += s
j += (1 if s[-1] != char else 0)
return [i for i in result if i != ""]