D: split string by comma, but not quoted string

D: split string by comma, but not quoted string - regex

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz

Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").

You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}

If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Related

How to get the first digit on the left side of a string with python and regex?

I want to get a specific digit based on the right string.
This stretch of string is in body2.txt
string = "<li>3 <span class='text-info'>quartos</span></li><li>1 <span class='text-info'>suíte</span></li><li>96<span class='text-info'>Área Útil (m²)</span></li>"
with open("body2.txt", 'r') as f:
area = re.compile(r'</span></li><li>(\d+)<span class="text-info">Área Útil')
area = area.findall(f.read())
print(area)
output: []
expected output: 96

You have a quote mismatch. Note carefully the difference between 'text-info' and "text-info" in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.
area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')
Focusing on the quotes, this is concatenating the strings '...class', "'", 'text-info', "'", and '>.... The rule there is that if you want a single quote ' in a single-quote raw string you instead write '"'"' and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>.... If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...) or using triple quoted raw strings r'''...'''). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.
As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.
from lxml import html
with open("body2.txt", 'r') as f:
tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)
The html.fromstring() method parses your data as html. The tree.xpath method uses xpath syntax to query that parsed tree. Roughly speaking it means the following:
// Arbitrarily far down in the tree
li A list node
[*] Satisfying whatever property is in the square brackets
contains(span/text(), 'Área Útil') The li node needs to have a span/text() node containing the text 'Área Útil'
/text() We want any text that is an immediate child of the root li we're describing.
I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.

This will get the right digits no matter how / what form the target is in.
Capture group 1 contains the digits.
r"(\d*)\s*<span(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\sclass\s*=\s*(?:(['\"])\s*text-info\s*\2))\s+(?=((?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+>))\3\s*Área\s+Útil"
https://regex101.com/r/pMATkj/1

Specific Regex Failing on Neko and Native

So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec....
here is my regex, I'm sorry its long, but I have no idea where the problem is...
^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$
if anyone knows what might be going on I'd appreciate it,
Thanks,
Nico
ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)
edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:
1,1,1,1
1,1,1,1
1,1,1,1
or this:
1,1,1,1,
1,1,1,1,
1,1,1,1,
but not:
xml blahh blahh>
1,m,1,1
1,1,b,1
1,1,1,1
xml>
(and yes I know thats not valid xml;))
edit: it gets stranger:
so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:
what happens is:
this will fail a .match() test, but not crash:
a
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
while this will crash the program:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1

To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).
Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.
I'd suggest something along the lines of
~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
There, "~/" and last "/" are haxe regexp markers.
I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).
Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.
I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.
Edit: the answer to question "why it eats so much memory"
Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:
parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
sets up read pointer to the first symbol of symbol string
sets up state poiter to initial state of regexp
takes up one pair from current state list and stores it as current state and current read pointer
reads symbol under current read pointer
matches it with symbols in states which current state have links to, and makes a list of states that matched.
if there is a final regexp state in this list, goes to 12
for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
if the current state list is empty, returns false, as string didn't match the regexp
goes to 6
here it is, in a final state of matched regexp, returns true, string matches regexp.
Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally.
Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)
Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.
And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.

Creating a simple parser in (V)C++ (2010) similar to PEG

For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file

Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.

Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression

RegEx for VB.net

I have a txt file with content
$NETS
P3V3_AUX_LGATE; PQ6.8 PU37.2
U335_PIN1; R3328.1 U335.1
$END
need to be updated in this format, and save back to another txt file
$NETS
'P3V3_AUX_LGATE'; PQ6.8 PU37.2
'U335_PIN1'; R3328.1 U335.1
$END
NOTE: number of lines may go up to 10,000 lines
My current solution is to read the txt file line by line, detect the presence of the ";" and newline character and do the changes.
Right now i have a variable that holds ALL the lines, is there other way something like Replace via RegEx to do the changes without looping thru each line, this way i can readily print the result
and follow up question, which one is more efficient?

Try
ResultString = Regex.Replace(SubjectString, "^([^;\r\n]+);", "'$1';", RegexOptions.Multiline)
on your multiline string.
This will find any string (length one or more) at the start of a line up until the first semicolon if there is one and replace it with its quoted equivalent.
It should be more efficient than looping through the string line by line as you're doing now, but if you're in doubt, you'd have to profile it.

You could probably find all the matches using something like \w+; but I don't know how you'd be able to do a replace on that using Regex.Replace to add the 's but keep the original match.
However, if you already have it as one variable, you don't have to read the file again, either you could make your code find all ;s and then find the previous newline for each, or you could use a String.Split on newlines to split the variable you've already got into lines.
And if you want to get it back to one variable you can just use String.Join.
Personally I'd normally use the String.Split (and possibly the String.Join if needed) method, since I think that would make the code easy to read.

I would say Yes! this can be done with Regular expressions. Make sure you got the "multiline" option turned on and craft your regular expression using some capture groups to ease the work.
I can however say this will NOT be the optimal one. Since you mention the amount of lines you could be processing, it seems 'resource wise' smarter to use a streaming approach instead of the in memory approach.
Taking the Regex approach (and this took 15 mins so please don't think this is an optimal solution, just prove it would work)
private static Regex matcher = new Regex(#"^\$NETS\r\n(?<entrytitle>.[^;]*);\s*(?<entryrest>.*)\r\n(?<entrytitle2>.[^;]*);\s*(?<entryrest2>.*)\r\n\$END\r\n", RegexOptions.Compiled | RegexOptions.Multiline);
static void Main(string[] args)
{
string newString = matcher.Replace(ExampleFileContent, new MatchEvaluator(evaluator));
}
static string evaluator(Match m)
{
return String.Format("$NETS\r\n'{0}'; {1}\r\n'{2}'; {3}\r\n$END\r\n",
m.Groups["entrytitle"].Value,
m.Groups["entryrest"].Value,
m.Groups["entrytitle2"].Value,
m.Groups["entryrest2"].Value);
}
Hope this helps,

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.

Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.

Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe

Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.

I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.

I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.

One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.

I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";

The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.

Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.

I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js