I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.
So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec....
here is my regex, I'm sorry its long, but I have no idea where the problem is...
^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$
if anyone knows what might be going on I'd appreciate it,
Thanks,
Nico
ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)
edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:
1,1,1,1
1,1,1,1
1,1,1,1
or this:
1,1,1,1,
1,1,1,1,
1,1,1,1,
but not:
xml blahh blahh>
1,m,1,1
1,1,b,1
1,1,1,1
xml>
(and yes I know thats not valid xml;))
edit: it gets stranger:
so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:
what happens is:
this will fail a .match() test, but not crash:
a
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
while this will crash the program:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1
To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).
Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.
I'd suggest something along the lines of
~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
There, "~/" and last "/" are haxe regexp markers.
I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).
Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.
I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.
Edit: the answer to question "why it eats so much memory"
Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:
parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
sets up read pointer to the first symbol of symbol string
sets up state poiter to initial state of regexp
takes up one pair from current state list and stores it as current state and current read pointer
reads symbol under current read pointer
matches it with symbols in states which current state have links to, and makes a list of states that matched.
if there is a final regexp state in this list, goes to 12
for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
if the current state list is empty, returns false, as string didn't match the regexp
goes to 6
here it is, in a final state of matched regexp, returns true, string matches regexp.
Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally.
Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)
Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.
And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.
I'm developing a small python like language using flex, byacc (for lexical and parsing) and C++, but i have a few questions regarding scope control.
just as python it uses white spaces (or tabs) for indentation, not only that but i want to implement index breaking like for instance if you type "break 2" inside a while loop that's inside another while loop it would not only break from the last one but from the first loop as well (hence the number 2 after break) and so on.
example:
while 1
while 1
break 2
'hello world'!! #will never reach this. "!!" outputs with a newline
end
'hello world again'!! #also will never reach this. again "!!" used for cout
end
#after break 2 it would jump right here
but since I don't have an "anti" tab character to check when a scope ends (like C for example i would just use the '}' char) i was wondering if this method would the the best:
I would define a global variable, like "int tabIndex" on my yacc file that i would access in my lex file using extern. then every time i find a tab character on my lex file i would increment that variable by 1. when parsing on my yacc file if i find a "break" keyword i would decrement by the amount typed after it from the tabIndex variable, and when i reach and EOF after compiling and i get a tabIndex != 0 i would output compilation error.
now the problem is, whats the best way to see if the indentation got reduced, should i read \b (backspace) chars from lex and then reduce the tabIndex variable (when the user doesn't use break)?
another method to achieve this?
also just another small question, i want every executable to have its starting point on the function called start() should i hardcode this onto my yacc file?
sorry for the long question any help is greatly appreciated. also if someone can provide an yacc file for python would be nice as a guideline (tried looking on Google and had no luck).
thanks in advance.
I am currently implementing a programming language rather similar to this (including the multilevel break oddly enough). My solution was to have the tokenizer emit indent and dedent tokens based on indentation. Eg:
while 1: # colons help :)
print('foo')
break 1
becomes:
["while", "1", ":",
indent,
"print", "(", "'foo'", ")",
"break", "1",
dedent]
It makes the tokenizer's handling of '\n' somewhat complicated though. Also, i wrote the tokenizer and parser from scratch, so i'm not sure whether this is feasable in lex and yacc.
Edit:
Semi-working pseudocode example:
level = 0
levels = []
for c = getc():
if c=='\n':
emit('\n')
n = 0
while (c=getc())==' ':
n += 1
if n > level:
emit(indent)
push(levels,n)
while n < level:
emit(dedent)
level = pop(levels)
if level < n:
error tokenize
# fall through
emit(c) #lazy example
Very interesting exercise. Can't you use the end keyword to check when the scope ends?
On a different note, I have never seen a language that allows you to break out of several nested loops at once. There may be a good reason for that...
I am using vim 7.x
I am using alternate file.
I have a mapping of *.hpp <--> *.cpp
Suppose I'm in
class Foo {
void some_me#mber_func(); // # = my cursor
}
in Foo.hpp
is there a way to tell vim to do the following:
Grab word under # (easy, expand("")
Look up the class I'm inside of ("Foo") <-- I have no idea how to do this
Append `1 & 2 (easy: using ".") --> "Foo::some_member_func"
4: Switch files (easy, :A)
Do a / on 4
So basically, I can script all of this together, except the "find the name of the enclosing class I'm in part (especially if classes are nested).
I know about ctags. I know about cscope. I'm choosing to not use them -- I prefer solutions where I understand where they break.
This is relatively easy to do crudely and very difficult to do well. C and C++ are rather complex languages to parse reliably. At the risk of being downvoted, I'd personally recommend parsing the tags file generated by ctags, but if you really want to do it in Vim, there are a few of options for the "crude" method.
Make some assumptions. The assumptions you make depend on how complicated you want it to be. At the simplest level: assume you're in a class definition and there are no other nearby braces. Based on your coding style, assume that the opening brace of the class definition is on the same line as "class".
let classlineRE = '^class\s\+\(\k\+\)\s\+{.*'
let match = search(classlineRE, 'bnW')
if match != 0
let classline = getline(match)
let classname = substitute(classline, classlineRE, '\1', '')
" Now do something with classname
endif
The assumptions model can obviously be extended/generalised as much as you see fit. You can just search back for the brace and then search back for class and take what's in between (to handle braces on a separate line to "class"). You can filter out comments. If you want to be really clever, you can start looking at what level of braces you're in and make sure it's a top level one (go to the start of the file, add 1 every time you see '{' and subtract one every time you see '}' etc). Your vim script will get very very very complicated.
Another one risking the downvote, you could use one of the various C parsers written in python and use the vim-python interface to make it act like a vim script. To be honest, if you're thinking of doing this, I'd stick with ctags/cscope.
Use rainbow.vim. This does highlighting based on depth of indentation, so you could be a little clever and search back (using search('{', 'bW') or similar) for opening braces, then interrogate the syntax highlighting of those braces (using synIDattr(synID(line("."), col("."),1), "name")) and if it's hlLevel0, you know it's a top-level brace. You can then search back for class and parse as per item 1.
I hope that all of the above gives you some food for thought...