Regex to delete characters before a string

Regex to delete characters before a string - regex

I'm having a text file like this. It has more than 500 thousand lines:
('12', '9', '56', 'Trojan.Genome.Win32.230770',
'04df65889035a471f8346565600841af',
'9190953854e36a248819e995078a060e0da2e687',
'b6488037431c283da6b9878969fecced695ca746afb738be49103bd57f37d4e4',
'2015-10-16 00:00:00', 'Zillya', '16', 'TROJAN', 'trojan.png',
'2016-01-14 21:35:44'); #line1
('13', '3', '54', 'UnclassifiedMalware',
'069506a02c4562260c971c8244bef301',
'd08e90874401d6f77768dd3983d398d427e46716',
'78e155e6a92d08cb1b180edfd4cc4aceeaa0f388cac5b0f44ab0af97518391a2',
'2015-10-15 00:00:00', 'Comodo', '6', 'MALWARE', 'malware.png',
'2016-01-14 21:35:44'); #line2
I only want to keep the text file into something like this:
Trojan.Genome.Win32.230770, 04df65889035a471f8346565600841af,
9190953854e36a248819e995078a060e0da2e687,
b6488037431c283da6b9878969fecced695ca746afb738be49103bd57f37d4e4
#line1
UnclassifiedMalware, 069506a02c4562260c971c8244bef301,
d08e90874401d6f77768dd3983d398d427e46716,
78e155e6a92d08cb1b180edfd4cc4aceeaa0f388cac5b0f44ab0af97518391a2
#line2
I have tried all of regex that I could think of but they didn't work.

If this is supposed to be done multiple times, this solution might be lacking, simply because of a lack in documentation.
Just applying regex to a file (maybe not even saving it) is not really reproducible / understandable for others.
I'm proposing a python small script to make clear what you are in fact doing. Besides you'll get full control over the exact format of the output, where it writes to etc.
# get regex module
import re
filename = 'path/to/your/file.txt'
# open file
with open(filename) as file_handle:
for line in file_handle:
# remove trailing whitespace
line = line.strip()
# if line is empty forget about it
if not line:
continue
# split into comment part and data part
data, comment = line.split(';')
# transform into comma seperated values
# aka. remove whitespace, parentheses, quotes
data = re.sub(r'\s|\(|\)|\'', '', line)
# file is build up like this (TODO: make names more logical)
nr1, nr2, nr3, \
name, \
hash1, hash2, hash3, \
first_date, discoverer, nr4, \
category, snapshot_file, last_date = data.split(',')
# print, or possibly write
print("{name:}, {hash1:}, {hash2:}, {hash3:} {comment:}".format(**locals()))

Since this is a comma-delimited file, you can use a regular expression to search and replace, although it won't be nearly as efficient as just splitting up your string in your programming language of voice.
'([^']*)',\s*
will find a single quote, then capture all the text until it encounters the next single quote, followed by the comma and any trailing whitespace.
You would then repeat that a bunch of times, once for each comma-separated field.
It would look a little like this, and then you can choose which fields to substitute back into your text. In this case, you want only fields \4 through \8.
Could it be written so the \1 through \3 are not captured? Certainly, using a non-capturing (?:...) group. Then your substitutions would range from \1 through \5. But this makes it flexible enough that if you want to include or exclude any of the other fields, it's as simple as including or excluding them in the substitution field.

Related

regular expression

I am trying to find a regular expression which should satisfy the following needs.
It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.
2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController : z rest as async texting: json, special character, spacses.....
I would like to have the separators her identified as following (Separator shown as X)
2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....
2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....
Exactly 8 separtors are found here.
Any ideas how to do this via regular expression?
My current approach does not work as I tried to to this like the following
Any ideas about this?
Update:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s
That's my current syntax to identify the separators.
Update 2:
Regex above is faulty.
Update 3:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
This is the current regex. Targetapproach is to call
df = pd.read_csv(file_name,
sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
engine='python',
nrows=100)
This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.
The last column line is not identified correctly. For unknown reasons yet.

If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.
The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:
def separate(logline):
fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
if len(fields) > 8:
# Fix up the ninth field. Perhaps you want to remove the colon:
fields[8] = fields[8][1:]
# or perhaps you want the text starting at the first non-whitespace
# character after the colon:
#
# if fields[8][0] == ':':
# fields[8] = fields[8].split(maxsplit=1)[1]
#
# etc.
return fields
>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
... + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
... + " c.w.f.w.NiceController"
... + " : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
'[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
'c.w.f.w.NiceController',
' z rest as async texting: json, special character, spaces.....']

Solution
The current outcome of my problem can be solved via the following regular expression.
(?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
Maybe minor adaptions have to be done maybe but for now it works pretty good.

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!

If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo

We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

Strip blank lines from first n lines using sed

I need to strip blank lines from only the first 6 lines of a text file. I've attempted to cobble together a solution using this StackOverflow question and this file but to no avail.
Here's the sed script I'm using (aliased as faprep='~/misc-scripts/fa-prep.sed), the last command is the one that's failing:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
# Pay attention to meeeeeeee!!!!
1,6{/./!d} # Remove blank lines from around titles??
Here's the command I'm running from terminal, which shows the last line failing to strip whitespace from the first 6 lines of the file (after all of the other modifications have been made, of course):
calyodelphi#dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
calyodelphi#dragonpad:~/pokemon-story/compilations $
The rest of the file is composed of a blank line after the third title and then paragraphs all separated by blank lines. I want to keep those blank lines, so that only the blank lines between the titles at the very top are stripped.
Just to clarify a few points: this file has Unix line endings, and the lines are supposed to not have spaces. Even viewing in a text editor that shows whitespace, each blank line contains only a newline character.

Since the discussion in the comments made it clear that you want to ignore empty lines in the first six lines of the body tag -- in other words, the first six times that part of the script is reached -- rather than the first six lines of the overall input data, you cannot use the global line counters. Since you're not using the hold buffer, we can use it to build our own counter, though.
So, replace
1,6 { /./! d }
with
x # swap in hold buffer
/.\{6\}/! { # if the counter in it hasn't reached 6
s/^/./ # increment by one (i.e., append a character)
x # swap the input back in
/./!d # if it is empty, discard it
x # otherwise swap back
}
x # and swap back one more time. This dance ensures that the
# line from the input is in the pattern space when we drop
# out at the bottom to the printing, regardless of which
# branches were entered.
Or, if this seems too complicated, use #glennjackman's suggestion and pipe the output of the first sed script through sed '1,6 { /./! d; }', since the second process will have its own line counters working on the preprocessed data. There's no fun in it, but it'll work.

This answer courtesy of #Wintermute's comments on my question pointing me in the right direction! I was mistakenly thinking that sed was working on the modified stream when I put that delete statement in at the very end. When I tried a different address (lines 9,14) it worked perfectly, but was too hackish for me to settle on. But this confirmed I needed to think of the stream as still including lines that I thought were already gone.
So I moved the delete statement up above the statement that clears out the <body> tags and everything outside them, and used a regex and the addr1,+N trick here to produce this final result:
The script:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,+6{/^$/d} # Remove blank lines from around titles
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
And the resulting output:
calyodelphi#dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
The next two weeks of training passed by too quickly and too slowly at the same time. [rest of paragraph omitted for space]
calyodelphi#dragonpad:~/pokemon-story/compilations $
Thanks #Wintermute! :D

Parsing log files

I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2

Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.

One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.

The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']

If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to delete characters before a string - regex

Related

regular expression

Extract a text string with regex

Strip blank lines from first n lines using sed

Parsing log files

How to split CSV line according to specific pattern

Categories

Resources