Find a comma within a string? - regex

Not sure if this is possible... but I need to find (and replace) all commas within strings, which I'm going to run on a PHP code file. i.e., something like "[^"]+,[^"]+" except that'll search on the wrong side of the strings too (the first quote is where a string ends, and the last one where it begins). I can run it multiple times to get all the commas, if necessary. I'm trying to use the Find-and-Replace feature in Komodo. This is a one-off job.
Well, here's my script so far, but it isn't working right. Worked on small test file, but on the full file its replacing commas outside of strings. Bah.
import sys, re
pattern = ','
replace = '~'
in_str = ''
out_str = ''
quote = None
in_file = open('infile.php', 'r')
out_file = open('outfile.php', 'w')
is_escaped = False # ...
while 1:
ch = in_file.read(1)
if not ch: break
if ch in ('"',"'"):
if quote is None:
quote = ch
elif quote == ch:
quote = None
out_file.write(out_str)
out_file.write(re.sub(pattern,replace,in_str))
in_str = ''
out_str = ''
if ch != quote and quote is not None:
in_str += ch
else:
out_str += ch
out_file.write(out_str)
out_file.write(in_str)
in_file.close()
out_file.close()

I take it your trying to find string literals in the PHP code (i.e. places in the code where someone has specified a string between quote marks: $somevar = "somevalue"; )
In this case, it may be easier to write a short piece of parsing code than a regex (since it will be complicated in the regex to distinguish the quote marks that begin a string literal from the quote marks that end it).
Some pseudocode:
inquote = false
while (!eof)
c = get_next_character()
if (c == QUOTE_MARK)
inquote = !inquote
if (c == COMMA)
if (inquote)
delete_current_character()

Related

Regex differentiate between all vs few char Uppercase in String

I have to pass a string into a program, depending on the string, it will return only one response value. I am facing difficulty in building patterns for two cases.
If a string ends with '?' and is not all uppercase return 'x', no matter what the contents of string.
If a string end with '?' and is all uppercase return 'y'.
If a string ends with '!' , or is all uppercase (no question mark at end) return 'z'.
If a string is only whitespace return 'a'.
Here are two example strings, they are to be four separate patterns -
phrase1 = "Simple String with some UPPercase in Between ends with?"
phrase2 = "BIG STRING ALL CAPS ENDS WITH?"
phrase3_a = "ALLCAPSSTRING NOTHING AT THE END OF STRING"
phrase3_b = "Any String with ALL UPPERCASE (or not) but ends with!"
phrase4 = "\t\t\t\t"
I haven't built accurate patterns, and that's what I'm asking here. After that I plan to use a single re.compile with all patterns & then finditer to use the group which is not None. In code below, I have removed the whitespaces,since if none of the other patterns match, matching a whitespace pattern [\s] will return None, which I can use separetely-
phrase=re.sub(r'[\s]','',phrase)
pattern_phrase1 = re.compile (r'[a-zA-Z0-9]\?$')
pattern_phrase2 = re.compile (r'[A-Z0-9]\?$')
pattern_phrase3 = re.compile (r'[A-Z]|[.!$]')
Solution 1 - using isx functions
def hey(phrase):
responses ={'ques':x,'ques_yell':y,'yell':z,'onlycall':b,'what':c}
phrase=''.join(phrase.split())
if phrase=='':
return responses['onlycall']
if phrase.isupper():
if phrase[-1]=='?':
return responses['ques_yell']
return responses['yell']
elif not phrase.isupper():
if phrase[-1]=='?':
return responses['ques']
return responses['what']

Allow user to pass a separator character by doubling it in C++

I have a C++ function that accepts strings in below format:
<WORD>: [VALUE]; <ANOTHER WORD>: [VALUE]; ...
This is the function:
std::wstring ExtractSubStringFromString(const std::wstring String, const std::wstring SubString) {
std::wstring S = std::wstring(String), SS = std::wstring(SubString), NS;
size_t ColonCount = NULL, SeparatorCount = NULL; WCHAR Separator = L';';
ColonCount = std::count(S.begin(), S.end(), L':');
SeparatorCount = std::count(S.begin(), S.end(), Separator);
if ((SS.find(Separator) != std::wstring::npos) || (SeparatorCount > ColonCount))
{
// SEPARATOR NEED TO BE ESCAPED, BUT DON'T KNOW TO DO THIS.
}
if (S.find(SS) != std::wstring::npos)
{
NS = S.substr(S.find(SS) + SS.length() + 1);
if (NS.find(Separator) != std::wstring::npos) { NS = NS.substr(NULL, NS.find(Separator)); }
if (NS[NS.length() - 1] == L']') { NS.pop_back(); }
return NS;
}
return L"";
}
Above function correctly outputs MANGO if I use it like:
ExtractSubStringFromString(L"[VALUE: MANGO; DATA: NOTHING]", L"VALUE")
However, if I have two escape separators in following string, I tried doubling like ;;, but I am still getting MANGO instead ;MANGO;:
ExtractSubStringFromString(L"[VALUE: ;;MANGO;;; DATA: NOTHING]", L"VALUE")
Here, value assigner is colon and separator is semicolon. I want to allow users to pass colons and semicolons to my function by doubling extra ones. Just like we escape double quotes, single quotes and many others in many scripting languages and programming languages, also in parameters in many commands of programs.
I thought hard but couldn't even think a way to do it. Can anyone please help me on this situation?
Thanks in advance.
You should search in the string for ;; and replace it with either a temporary filler char or string which can later be referenced and replaced with the value.
So basically:
1) Search through the string and replace all instances of ;; with \tempFill- It would be best to pick a combination of characters that would be highly unlikely to be in the original string.
2) Parse the string
3) Replace all instances of \tempFill with ;
Note: It would be wise to run an assert on your string to ensure that your \tempFill (or whatever you choose as the filler) is not in the original string to prevent an bug/fault/error. You could use a character such as a \n and make sure there are non in the original string.
Disclaimer:
I can almost guarantee there are cleaner and more efficient ways to do this but this is the simplest way to do it.
First as the substring does not need to be splitted I assume that it does not need to b pre-processed to filter escaped separators.
Then on the main string, the simplest way IMHO is to filter the escaped separators when you search them in the string. Pseudo code (assuming the enclosing [] have been removed):
last_index = begin_of_string
index_of_current_substring = begin_of_string
loop: search a separator starting at last index - if not found exit loop
ok: found one at ix
if char at ix+1 is a separator (meaning with have an escaped separator
remove character at ix from string by copying all characters after it one step to the left
last_index = ix+1
continue loop
else this is a true separator
search a column in [ index_of_current_substring, ix [
if not found: error incorrect string
say found at c
compare key_string with string[index_of_current_substring, c [
if equal - ok we found the key
value is string[ c+2 (skip a space after the colum), ix [
return value - search is finished
else - it is not our key, just continue searching
index_of_current_substring = ix+1
last_index = index_of_current_substring
continue loop
It should now be easy to convert that to C++

Python - How to replace spaces in simple line of text with a specific character

I am looking for some help in reformatting a line of text, from within a python script, so that I can replace certain characters with others, or spaces with a specific character. For clarity, the text I am trying to reformat is assigned to a variable.
I have searched for this feature but I have not seen how this can be done!
I have tried to write a full function for you to fit your purpose:
def posSubStringReplaceRecursively(string1, search, replace, pos, is_First):
global string2
if is_First:
string2 = list(copy.deepcopy(string1))
is_First = False
index = str.rfind(string1, search)
if index != -1:
string1 = string1[:(index+len(search)-1)]
pos.append(index)
return posSubStringReplaceRecursively(string1, search, replace, pos, is_First)
else:
pos = pos
for i in xrange(len(pos)):
string2[pos[i]:pos[i]+len(search)] = replace
string2 = " ".join(string2)
return string2
Calling the function:
string1 = raw_input("Enter string:\n")
search = raw_input("Enter word to find:\n")
replace = raw_input("Enter word to replace:\n")
print posSubStringReplaceRecursively(string1, search, replace, [], True)

c# How to Split CSV string which have string with commas [duplicate]

This question already has answers here:
Reading CSV files using C#
(12 answers)
Closed 7 years ago.
I have below mentioned CSV string which I need to split using commas .
Input:
A,"Rakesh,Gaur",B,"A,B",Z
OutPut:
A
Rakesh,Gaur
B
A,B
Z
You can't use string split or regular expressions. If you are not going to use a library that is already built, you have to keep track of whether or not you are in_quotes. but as you will find out after you start this: csv parsing is complex. You should use something that is already pre-built. As I recall from my days writing an app that heavily relied on csv, there are escape characters and such, that you will need to account for.
Either way the psuedo code is as follows:
Stack cells = m
in_quotes = false
foreach character in string:
if character != ',' && character != '"':
cells.Top = cells.Top + character
else if character == ',' && in_quotes:
cells.Top = cells.Top + character
else if character == ',':
cells.push("")
else if character == '"' && in_quotes:
in_quotes = false
else if character == '"':
in_quotes = true
I think you can do this using following steps:
string[] words = yourStringInput.Split(',');
foreach (string word in words)
{
Console.WriteLine(word);
}

Python split string without splitting escaped character

Is there a way to split a string without splitting escaped character? For example, I have a string and want to split by ':' and not by '\:'
http\://www.example.url:ftp\://www.example.url
The result should be the following:
['http\://www.example.url' , 'ftp\://www.example.url']
There is a much easier way using a regex with a negative lookbehind assertion:
re.split(r'(?<!\\):', str)
As Ignacio says, yes, but not trivially in one go. The issue is that you need lookback to determine if you're at an escaped delimiter or not, and the basic string.split doesn't provide that functionality.
If this isn't inside a tight loop so performance isn't a significant issue, you can do it by first splitting on the escaped delimiters, then performing the split, and then merging. Ugly demo code follows:
# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
# split by escaped, then by not-escaped
escaped_delim = '\\'+delim
sections = [p.split(delim) for p in s.split(escaped_delim)]
ret = []
prev = None
for parts in sections: # for each list of "real" splits
if prev is None:
if len(parts) > 1:
# Add first item, unless it's also the last in its section
ret.append(parts[0])
else:
# Add the previous last item joined to the first item
ret.append(escaped_delim.join([prev, parts[0]]))
for part in parts[1:-1]:
# Add all the items in the middle
ret.append(part)
prev = parts[-1]
return ret
s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':'))
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']
Alternately, it might be easier to follow the logic if you just split the string by hand.
def escaped_split(s, delim):
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == '\\':
try:
# skip the next character; it has been escaped!
current.append('\\')
current.append(next(itr))
except StopIteration:
pass
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
Note that this second version behaves slightly differently when it encounters double-escapes followed by a delimiter: this function allows escaped escape characters, so that escaped_split(r'a\\:b', ':') returns ['a\\\\', 'b'], because the first \ escapes the second one, leaving the : to be interpreted as a real delimiter. So that's something to watch out for.
The edited version of Henry's answer with Python3 compatibility, tests and fix some issues:
def split_unescape(s, delim, escape='\\', unescape=True):
"""
>>> split_unescape('foo,bar', ',')
['foo', 'bar']
>>> split_unescape('foo$,bar', ',', '$')
['foo,bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=True)
['foo$', 'bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=False)
['foo$$', 'bar']
>>> split_unescape('foo$', ',', '$', unescape=True)
['foo$']
"""
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == escape:
try:
# skip the next character; it has been escaped!
if not unescape:
current.append(escape)
current.append(next(itr))
except StopIteration:
if unescape:
current.append(escape)
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
building on #user629923's suggestion, but being much simpler than other answers:
import re
DBL_ESC = "!double escape!"
s = r"Hello:World\:Goodbye\\:Cruel\\\:World"
map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))
Here is an efficient solution that handles double-escapes correctly, i.e. any subsequent delimiter is not escaped. It ignores an incorrect single-escape as the last character of the string.
It is very efficient because it iterates over the input string exactly once, manipulating indices instead of copying strings around. Instead of constructing a list, it returns a generator.
def split_esc(string, delimiter):
if len(delimiter) != 1:
raise ValueError('Invalid delimiter: ' + delimiter)
ln = len(string)
i = 0
j = 0
while j < ln:
if string[j] == '\\':
if j + 1 >= ln:
yield string[i:j]
return
j += 1
elif string[j] == delimiter:
yield string[i:j]
i = j + 1
j += 1
yield string[i:j]
To allow for delimiters longer than a single character, simply advance i and j by the length of the delimiter in the "elif" case. This assumes that a single escape character escapes the entire delimiter, rather than a single character.
Tested with Python 3.5.1.
There is no builtin function for that.
Here's an efficient, general and tested function, which even supports delimiters of any length:
def escape_split(s, delim):
i, res, buf = 0, [], ''
while True:
j, e = s.find(delim, i), 0
if j < 0: # end reached
return res + [buf + s[i:]] # add remainder
while j - e and s[j - e - 1] == '\\':
e += 1 # number of escapes
d = e // 2 # number of double escapes
if e != d * 2: # odd number of escapes
buf += s[i:j - d - 1] + s[j] # add the escaped char
i = j + 1 # and skip it
continue # add more to buf
res.append(buf + s[i:j - d])
i, buf = j + len(delim), '' # start after delim
I think a simple C like parsing would be much more simple and robust.
def escaped_split(str, ch):
if len(ch) > 1:
raise ValueError('Expected split character. Found string!')
out = []
part = ''
escape = False
for i in range(len(str)):
if not escape and str[i] == ch:
out.append(part)
part = ''
else:
part += str[i]
escape = not escape and str[i] == '\\'
if len(part):
out.append(part)
return out
I have created this method, which is inspired by Henry Keiter's answer, but has the following advantages:
Variable escape character and delimiter
Do not remove the escape character if it is actually not escaping something
This is the code:
def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
result = []
current_element = []
iterator = iter(string)
for character in iterator:
if character == self.release_indicator:
try:
next_character = next(iterator)
if next_character != delimiter and next_character != escape:
# Do not copy the escape character if it is inteded to escape either the delimiter or the
# escape character itself. Copy the escape character if it is not in use to escape one of these
# characters.
current_element.append(escape)
current_element.append(next_character)
except StopIteration:
current_element.append(escape)
elif character == delimiter:
# split! (add current to the list and reset it)
result.append(''.join(current_element))
current_element = []
else:
current_element.append(character)
result.append(''.join(current_element))
return result
This is test code indicating the behavior:
def test_split_string(self):
# Verify normal behavior
self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))
# Verify that escape character escapes the delimiter
self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))
# Verify that the escape character escapes the escape character
self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))
# Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))
I really know this is an old question, but i needed recently an function like this and not found any that was compliant with my requirements.
Rules:
Escape char only works when used with escape char or delimiter. Ex. if delimiter is / and escape are \ then (\a\b\c/abc bacame ['\a\b\c', 'abc']
Multiple escapes chars will be escaped. (\\ became \)
So, for the record and if someone look anything like, here my function proposal:
def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
"""Splits an string using delimiter and escape chars
Args:
str_to_escape ([type]): The text to be splitted
delimiter (str, optional): Delimiter used. Defaults to ','.
escape (str, optional): The escape char. Defaults to '\'.
Yields:
[type]: a list of string to be escaped
"""
if len(delimiter) > 1 or len(escape) > 1:
raise ValueError("Either delimiter or escape must be an one char value")
token = ''
escaped = False
for c in str_to_escape:
if c == escape:
if escaped:
token += escape
escaped = False
else:
escaped = True
continue
if c == delimiter:
if not escaped:
yield token
token = ''
else:
token += c
escaped = False
else:
if escaped:
token += escape
escaped = False
token += c
yield token
For the sake of sanity, i'm make some tests:
# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
('r/\\/teste/g', ['r', '/teste', 'g']),
('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
('r/\\.$//g', ['r', '\\.$', '', 'g']),
('u///g', ['u', '', '', 'g']),
('s/(/[/g', ['s', '(', '[', 'g']),
('s/)/]/g', ['s', ')', ']', 'g']),
('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]
tests_bar_escape = [
('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]
def test(test_array, escape):
"""From input data, test escape functions
Args:
test_array ([type]): [description]
escape ([type]): [description]
"""
for t in test_array:
resg = str_escape_split(t[0], '/', escape)
res = list(resg)
if res == t[1]:
print(f"Test {t[0]}: {res} - Pass!")
else:
print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")
def test_all():
test(tests_slash_escape, '\\')
test(tests_bar_escape, '|')
if __name__ == "__main__":
test_all()
Note that : doesn't appear to be a character that needs escaping.
The simplest way that I can think of to accomplish this is to split on the character, and then add it back in when it is escaped.
Sample code (In much need of some neatening.):
def splitNoEscapes(string, char):
sections = string.split(char)
sections = [i + (char if i[-1] == "\\" else "") for i in sections]
result = ["" for i in sections]
j = 0
for s in sections:
result[j] += s
j += (1 if s[-1] != char else 0)
return [i for i in result if i != ""]