Extract text between single quotes in MATLAB - regex

I have multiple lines in some text files such as
.model sdata1 s tstonefile='../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p' passive=2
I want to extract the text between the single quotes in MATLAB.
Much help would be appreciated.

To get all of the text inside multiple '' blocks, regexp can be used as follows:
regexp(txt,'''(.[^'']*)''','tokens')
This says to get text surrounded by ' characters, which does not include a ' in the captured text. For example, consider this file with two lines (I made up different file name),
txt = ['.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2 ', char(10), ...
'.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'' passive=2']
>> stringCell = regexp(txt,'''(.[^'']*)''','tokens');
>> stringCell{:}
ans =
'../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'
ans =
'../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'
>>
Trivia:
char(10) gives a newline character because 10 is the ASCII code for newline.
The . character in regexp (regex in the rest of the coding word) pattern usually does not match a newline, which would make this a safer pattern. In MATLAB, a dot in regexp does match a newline, so to disable this, we could add 'dotexceptnewline' as the last input argument to `regexp``. This is convenient to ensure we don't get the text outside of the quotes instead, but not needed since the first match sets precedent.
Instead of excluding a ' from the match with [^''], the match can be made non-greedy with ? as follows, regexp(txt,'''(.*?)''','tokens').

If you plan to use textscan:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','''');
fclose(fid);
output = rawdata{:}(2)
As also used in other answers the single apostrophe 'is represented by a double one: '', e.g. for delimiters.
considering the comment:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','\n');
fclose(fid);
lines = rawdata{1,1};
L = size(lines,1);
output = cell(L,1);
for ii=1:L
temp = textscan(lines{ii},'%s','delimiter','''');
output{ii,1} = temp{:}(2);
end

One easy way is to split the string with single quote delimiter and take the even-numbered strings in the output:
str = fileread('test.txt');
out = regexp(str, '''', 'split');
out = out(2:2:end);

You can do this using regular expressions. Assuming that there is only one occurrence of text between quotation marks:
% select all chars between single quotation marks.
out = regexp(inputString,'''(.*)''','tokens','once');

After identifing which lines you want to extract info from, you could tokenize it or do something like this if they all have the same form:
test='.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2';
a=strfind(test,'''')
test=test(a(1):a(2))

Related

REGEX_TOO_COMPLEX error when parsing regex expression

I need to split the CSV file at commas, but the problem is that file can contain commas inside fields. So for an example:
one,two,tree,"four,five","six,seven".
It uses double quotes to escape, but I could not solve it.
I tried to use something like this with this regex, but I got an error: REGEX_TOO_COMPLEX.
data: lv_sep type string,
lv_rep_pat type string.
data(lv_row) = iv_row.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
concatenate '$1$2' lv_sep into lv_rep_pat.
"replace all commas that are separator with the new separator
replace all occurrences of regex '(?:"((?:""|[^"]+)+)"|([^,]*))(?:,|$)' in lv_row with lv_rep_pat.
split lv_row at lv_sep into table rt_cells.
You must use this Regex => ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
DATA: lv_sep TYPE string,
lv_rep_pat TYPE string.
DATA(lv_row) = 'one,two,tree,"four,five","six,seven"'.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
CONCATENATE '$1$2' lv_sep INTO lv_rep_pat.
"replace all commas that are separator with the new separator
REPLACE ALL OCCURRENCES OF REGEX ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)' IN lv_row WITH lv_rep_pat.
SPLIT lv_row AT lv_sep INTO TABLE data(rt_cells).
LOOP AT rt_cells into data(cells).
WRITE cells.
SKIP.
ENDLOOP.
Testing output
I never ever touched ABAP, so please see this as pseudo code
I'd recommend using a non-regex solution here:
data: checkedOffsetComma type i,
checkedOffsetQuotes type i,
baseOffset type i,
testString type string value 'val1, "val2, val21", val3'.
LOOP AT SomeFancyConditionYouDefine.
checkedOffsetComma = baseOffset.
checkedOffsetQuotes = baseOffset.
find FIRST OCCURRENCE OF ','(or end of line here) in testString match OFFSET checkedOffsetComma.
write checkedOffsetComma.
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write checkedOffsetQuotes.
*if the next comma is closer than the next quotes
IF checkedOffsetComma < checkedOffsetQuotes.
REPLACE SECTION checkedOffsetComma 1 OF ',' WITH lv_rep_pat.
baseOffset = checkedOffsetComma.
ELSE.
*if we found quotes, we go to the next quotes afterwards and then continue as before after that position
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write baseOffset.
ENDIF.
ENDLOOP.
This assumes that there are no quotes in quotes thingies. Didn't test, didn't validate in any way. I'd be happy if this at least partly compiles :)

Regex cheating in csv-parsing delimited at comma, ignore in quotes

all
So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.
I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.
I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.
Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.
Thank you in advance
Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
System.Text.Encoding.Default) 'system text decoding adds odd characters
filereader1.TextFieldType = FieldType.Delimited
'filereader1.Delimiters = New String() {","}
filereader1.SetDelimiters(",")
filereader1.HasFieldsEnclosedInQuotes = True
For Each c As Char In whole_string
If c = """" Then cnt = cnt + 1
Next
If cnt = 0 Then 'no quotes
split_string = Split(whole_string, ",") 'split by commas
ElseIf cnt Mod 2 = 0 Then 'even number of quotes
split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
ElseIf cnt <> 0 Then 'odd number of quotes
whole_string = whole_string.Replace("""", " ") 'delete all quotes
split_string = Split(whole_string, ",") 'split by commas
End If
In VB.NET, there are several ways to proceed.
Option 1
You can use this regex: ,(?![^",]*")
It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.
In VB.NET, something like:
Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")
Option 2
This uses this beautifully simple regex: "[^"]*"|(,)
This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.
The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This code should work:
Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized
Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject,
Function(m As Match)
If (m.Groups(1).Value = "") Then
Return ""
Else
Return m.Groups(0).Value
End If
End Function)
Console.WriteLine(Replaced)
Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Replace all occurrences of space until the first comma regex?

So basically I have a CSV like:
121\sdf\ 34 4333DSssD,23233,TECH,32, ...
that first string is the ID but its supposed to have + not spaces. They got trimmed out, so now on each line until the first comma I need to replace any spaces with +.
Was thinking of using regex for this and re.sub (processing using python) but am having trouble only getting the spaces.
Was hoping StackOverflow could help :D
This can be done without a regex; just partition on the comma and manipulate the left partition
with open('path/to/input') as infile:
for line in infile:
left, comma, right = line.partition(',')
print "%s%s%s" %(left.replace(' ', "+"), comma, right)
Here is one solution without regular expressions (assuming you have a string with a single line called line, this would probably be inside of a for loop that is iterating over the file object):
pieces = line.split(',', 1)
pieces[0] = pieces[0].replace(' ', '+')
line = ','.join(pieces)
Or with regular expressions:
import re
line = re.sub(r'^[^,]*', lambda m: m.group(0).replace(' ', '+'), line)

Find and Replace first character after a certain pattern

Current text
Variable length text = some string(some more text
Change to
Variable length text = some string(addition some more text
Need to add a certain text after first parenthesis in a line only after "=" character is encountered. Another condition is to ignore patterns like "= (", which essentially means you should ignore patterns with only space between "=" and "("
My Try:
sed -e "s#*=(\w\()#\1(addition#g"
Thanks in anticipation!
Tweak this for your needs:
$ echo 'Variable length text = some string(some more text' |\
sed 's/^[^=]*=[^(]*[[:alnum:]][^(]*(/&addition /'
That matches for:
Beginning of the string
Anything but = any number of times
=
Anything but ( any number of times
An alpha-numeric character
Anything but ( any number of times
(
... and substitutes it with the matched string adding ' addition' to it.
The output is
Variable length text = some string(addition some more text
in perl
s/(.*?=[\s][^(]+?)\((.*)/$1(aditional text $2/

Regular Expression to split by comma + ignores comma within double quotes. VB.NET

I'm trying to parse csv file with VB.NET.
csv files contains value like 0,"1,2,3",4 which splits in 5 instead of 3. There are many examples with other languages in Stockoverflow but I can't implement it in VB.NET.
Here is my code so far but it doesn't work...
Dim t As String() = Regex.Split(str(i), ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)")
Assuming your csv is well-formed (ie no " besides those used to delimit string fields, or besides ones escaped like \"), you can split on a comma that's followed by an even number of non-escaped "-marks. (If you're inside a set of "" there's only an odd number left in the line).
Your regex you've tried looks like you're almost there.
The following looks for a comma followed by an even number of any sort of quote marks:
,(?=([^"]*"[^"]*")*[^"]*$)
To modify it to look for an even number of non-escaped quote marks (assuming quote marks are escaped with backslash like \"), I replace each [^"] with ([^"\\]|\\.). This means "match a character that isn't a " and isn't a blackslash, OR match a backslash and the character immediately following it".
,(?=(([^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)
See it in action here.
(The reason the backslash is doubled is I want to match a literal backslash).
Now to get it into vb.net you just need to double all your quote marks:
splitRegex = ",(?=(([^""\\]|\\.)*""([^""\\]|\\.)*"")*([^""\\]|\\.)*$)"
Instead of a regular expression, try using the TextFieldParser class for reading .csv files. It handles your situation exactly.
TextFieldParserClass
Especially look at the HasFieldsEnclosedInQuotes property.
Example:
Note: I used a string instead of a file, but the result would be the same.
Dim theString As String = "1,""2,3,4"",5"
Using rdr As New StringReader(theString)
Using parser As New TextFieldParser(rdr)
parser.TextFieldType = FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
For i As Integer = 0 To fields.Length - 1
Console.WriteLine("Field {0}: {1}", i, fields(i))
Next
End Using
End Using
Output:
Field 0: 1
Field 1: 2,3,4
Field 2: 5
This worked great for parsing a Shipping Notice .csv file we receive. Thanks for keeping this solution here.
This is my version of the code:
Try
Using rdr As New IO.StringReader(Row.FlatFile)
Using parser As New FileIO.TextFieldParser(rdr)
parser.TextFieldType = FileIO.FieldType.Delimited
parser.Delimiters = New String() {","}
parser.HasFieldsEnclosedInQuotes = True
Dim fields() As String = parser.ReadFields()
Row.Account = fields(0).ToString().Trim()
Row.AccountName = fields.GetValue(1).ToString().Trim()
Row.Status = fields.GetValue(2).ToString().Trim()
Row.PONumber = fields.GetValue(3).ToString().Trim()
Row.ErrorMessage = ""
End Using
End Using
Catch ex As Exception
Row.ErrorMessage = ex.Message
End Try
It's possible to do it with regex VB.NET in the following way:
,(?=(?:[^"]*"[^"]*")*[^"]*$)
The positive lookahead ((?= ... )) ensures that there is an even number of quotes ahead of the comma to split on (i.e. either they occur in pairs, or there are none).
[^"]* matches non-quote characters.
Given below is a VB.NET example to apply the regex.
Imports System
Imports System.Text.RegularExpressions
Public Class Test
Public Shared Sub Main()
Dim theString As String = "1,""2,3,4"",5"
Dim theStringArray As String() = Regex.Split(theString, ",(?=(?:[^""\\]*""[^""\\]*"")*[^""\\]*$)")
For i As Integer = 0 To theStringArray.Length - 1
Console.WriteLine("theStringArray {0}: {1}", i, theStringArray(i))
Next
End Sub
End Class
'Output:
'theStringArray 0: 1
'theStringArray 1: "2,3,4"
'theStringArray 2: 5