Extracting values from comma separated lists - regex

When given a list of comma separated values like 3, asdf, *#, 1212.3, I would like to extract each of these values, not including the comma, so I would have a value list like [3, "asdf", "*#", 1212.3] (not as a textual representation like that, but as an array of 'hits'). How would I do this?

I see you're using the D programming language. Here is a link to a CSV parser for D.

First off, if you are dealing with CSV files, don't use regex or your own parser. Basically when you think things are simple they really aren't, Stop Rolling Your Own CSV Parser.
Next up, you say that you would like to have an array ([3, "asdf", "*#", 1212.3]). This looks to be mixing types and can not be done in a static language. And ultimately is very inefficient even using std.variant. For each parsed value you'd have code like:
try {
auto data = to!double(parsedValue);
auto data2 = to!int(data);
if(data == data2)
returnThis = Variant(data2);
else
returnThis = Variant(data);
} catch(ConvException ce) { }
Now if your data is truely separated by some defined set of characters, and isn't broken into records with new lines, then you can use split(", ") from std.algorithm. Otherwise use a CSV parser. If you don't want to follow the standard wrap the parser so the data is what you desire. In your example you have spaces, which are not to be ignored by the CSV format, so call strip() on the output.
In the article I linked it mentions that what commonly happens is that people will write a parser in its simplest form and not handle the more complicated cases. So when you look for a CSV parser you'll find many that just don't cut it. This writing your own parser comes up, which I say is fine just handle all valid CSV files.
Luckily you don't need to write your own as I reciently made a CSV Parser for D. Error checking isn't done currently, I don't know the best way to report issues such that parsing can be corrected and continue. Usage examples are found in the unittest blocks. You can parse over a struct too:
struct MyData {
int a;
string b;
string c;
double d
}
foreach(data; csv.csv!MyData(str)) // I think I'll need to change the module/function name
//...

in perl you could do something like:
#anArray = split(',', "A,B,C,D,E,F,G");

(?:,|\s+)?([^ ,]+) should do.
It skips a comma or space, then selects anything but a comma or space. Modify to taste.

Related

How to get the first digit on the left side of a string with python and regex?

I want to get a specific digit based on the right string.
This stretch of string is in body2.txt
string = "<li>3 <span class='text-info'>quartos</span></li><li>1 <span class='text-info'>suíte</span></li><li>96<span class='text-info'>Área Útil (m²)</span></li>"
with open("body2.txt", 'r') as f:
area = re.compile(r'</span></li><li>(\d+)<span class="text-info">Área Útil')
area = area.findall(f.read())
print(area)
output: []
expected output: 96
You have a quote mismatch. Note carefully the difference between 'text-info' and "text-info" in your example string and in your compiled regex. IIRC escaping quotes in raw strings is a bit of a pain in Python (if it's even possible?), but string concatenation sidesteps the issue handily.
area = re.compile(r'</span></li><li>(\d+)<span class='"'"'text-info'"'"'>Área Útil')
Focusing on the quotes, this is concatenating the strings '...class', "'", 'text-info', "'", and '>.... The rule there is that if you want a single quote ' in a single-quote raw string you instead write '"'"' and try to ignore Turing turning in his grave. I haven't tested the performance, but I think it might behave much like '...class' + "'" + 'text-info' + "'" + '>.... If that's the case, there is a bunch of copying happening behind the scenes, and that strategy has a quadratic runtime in the number of pieces being concatenated (assuming they're roughly the same size and otherwise generally nice for such an analysis). You'd be better off with nearly any other strategy (such as ''.join(...) or using triple quoted raw strings r'''...'''). It might not be a problem though. Benchmark your solution and see if it's good enough before messing with alternatives.
As one of the comments mentioned, you probably want to be parsing the HTML with something more powerful than regex. Regex cannot properly parse arbitrary HTML since it can't parse arbitrarily nested structures. There are plenty of libraries to make the job easier though and handle all of the bracket matching and string munging for you so that you can focus on a high-level description of exactly the data you want. I'm a fan of lxml. Without putting a ton of time into it, something like the following would be roughly equivalent to what you're doing.
from lxml import html
with open("body2.txt", 'r') as f:
tree = html.fromstring(f.read())
area = tree.xpath("//li[contains(span/text(), 'Área Útil')]/text()")
print(area)
The html.fromstring() method parses your data as html. The tree.xpath method uses xpath syntax to query that parsed tree. Roughly speaking it means the following:
// Arbitrarily far down in the tree
li A list node
[*] Satisfying whatever property is in the square brackets
contains(span/text(), 'Área Útil') The li node needs to have a span/text() node containing the text 'Área Útil'
/text() We want any text that is an immediate child of the root li we're describing.
I'm working on a pretty small amount of text here and don't know what your document structure is in the general case. You could add or change any of those properties to better describe the exact document you're parsing. When you inspect an element, any modern browser is able to generate a decent xpath expression to pick out exactly the element you're inspecting. Supposing this snippet came from a larger document I would imagine that functionality would be a time saver for you.
This will get the right digits no matter how / what form the target is in.
Capture group 1 contains the digits.
r"(\d*)\s*<span(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\sclass\s*=\s*(?:(['\"])\s*text-info\s*\2))\s+(?=((?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+>))\3\s*Área\s+Útil"
https://regex101.com/r/pMATkj/1

C++ trying to read in malformed CSV with erroneous commas

I am trying to make a simple CSV file parser to transfer a large number of orders from an order system to an invoicing system. The issue is that the CSV which i am downloading has erroneous commas which are sometimes present in the name field and so this throws the whole process off.
The company INSISTS, which is really starting to piss me off, that they are simply copying data they receive into the CSV and so it's valid data.
Excel mostly seems to interpret this correctly or at least puts the data in the right field, my program however doesn't. I opened the CSV in notepad++ and there is no quotes around strings just raw string separated by commas.
This is currently how i am reading the file.
int main()
{
string t;
getline(cin, t);
string Output;
string path = "in.csv";
ifstream input(path);
vstring readout;
vstring contact, InvoiceNumber, InvoiceDate, DueDate, Description, Quantity, UnitAmount, AccountCode, TaxType, Currency, Allocator, test, Backup, AllocatorBackup;
vector<int> read, add, total;
if (input.is_open()) {
for (string line; getline(input, line); ) {
auto arr = explode(line, ',');
contact.push_back(arr[7]); // Source site is the customer in this instance.
InvoiceNumber.push_back(arr[0]); // OrderID will be invoice number
InvoiceDate.push_back(arr[1]); // Perchase date
DueDate.push_back(arr[1]); // Same as order date
Description.push_back(arr[0]);
Quantity.push_back(arr[0]);
UnitAmount.push_back(arr[10]); // The Total
AccountCode.push_back(arr[7]); // Will be set depending on other factors - But contains the site of perchase
Currency.push_back(arr[11]); // EUR/GBP
Allocator.push_back(arr[6]); // This will decide the VAT treatment normally.
AllocatorBackup.push_back(arr[5]); // This will decide VAT treatment if the column is off by one.
Backup.push_back(arr[12]);
TaxType = Currency;
}
}
return 0;
}
vstring explode(string const & s, char delim) {
vstring result;
istringstream q(s);
for (string token; getline(q, token, delim); ) {
result.push_back(move(token));
}
return result;
}
Vstring is a compiler macro i created to save me typing vector so often, so it's the same thing.
The issue is when i come across one of the fields with the comma in it (normally the name field which is [3]) it of cause pushes everything back by one so account code becomes [8] etc.. This is extremely troublesome as it's difficult to tell weather or not i am dealing with correct data in the next field or not in some cases.
So two questions:
1) Is there any simple way in which i could detect this anomaly and correct for it that i've missed? I of cause do try to check in my loop where i can if valid data is where it's expected to be, but this is becoming messy and does not cope with more than one comma.
2) Is the company correct in telling me that it's "Expected behavior" to allow commas entered by a customer to creep into this CSV without being processed or have they completely misunderstood the CSV "standard"?
Retired Ninja mentioned in the comments that one constraint would be to parse all fields either side of the 'problem field' first, and then put the remaining data into the problem field. This is the best approach if you know which field might contain corruption. If you don't know which field could be corrupted, you still have options though!
You know:
The number of fields that should be present
Something about the type of data in each of those fields.
If you codify the types of the fields (implement classes for different data types, so your vectors of strings would become vectors of OrderIDs or Dates or Counts or....), you can test different concatenations (joining adjacent fields that are separated by a comma) and score them according to how many of the fields pass some data validation. You then choose the best scoring interpretation of the data. This would build some data validation into the process, and make everything a bit more robust.
'csv' is not that well defined. There is the standard way, where ',' seperates the columns and '\n' the rows. Sometimes ' " ' is used to handle these symbols inside a field. But Excel includes them only if a Control Character is involved.
Here the definition from Wiki.
RFC 4180 formalized CSV. It defines the MIME type "text/csv", and CSV files that follow its rules should be very widely portable. Among its requirements:
-MS-DOS-style lines that end with (CR/LF) characters (optional for the
last line).
-An optional header record (there is no sure way to detect
whether it is present, so care is required when importing).
-Each record "should" contain the same number of comma-separated fields.
-Any field may be quoted (with double quotes).
-Fields containing a line-break, double-quote or commas should be quoted. (If > they are not, the file will likely be impossible to process correctly).
-A (double)quote character in a field must be represented by two (double) quote > characters.
Comma-separated values
Keep in mind that Excel has different settings on different systems/system language settings. It might be, that their Excel is parsing it correctly, but somewhere else it isn't.
For Example, in countries like Germany there is ';' used to seperate the columns. The decimal seperators differ as well.
1.5 << english
1,5 << german
Same goes for the thousand seperator.
1,000,000 << english
1.000.000 << german
or
1 000 000 << also german
Now, Excel also has different csv export settings like .csv(Seperated values), .csv(MACINTOSH) and .csv(MS-DOS) so I guess there can be differences too.
Now for your questions, in my opinion they are not clearly wrong with what they are doing with their files. But you should think about discussing about a (E)BNF with them. Here some Links:
BNF
EBNF
It is a grammar on which you decide on and with clear definitions the code should be no problem. I know customers can block something like this, because they don't want to have extra work, but it is simply the best solution. If you want ' " ' in your file, they should provide you somehow. I don't know how they copy their data, but it should also be some kind of program (I don't think they do this per hand?), so your code and their code should use the same (E)BNF which you decide on together with them.

HTML tokenizer algorithm

I'm trying to write a basic html parser which doesn't tolerate errors and was reading HTML5 parsing algorithm but it's just too much information for a simple parser. I was wondering if someone had an idea on the logic for a basic tokenizer which would simply turn a small html into a list of significant tokens. I'm more of interested in the logic than the code..
std::string html = "<div id='test'> Hello <span>World</span></div>";
Tokenizer t;
t.tokenize(html);
So for the above html, I want to convert it to a list of something like this:
["<","div","id", "=", "test", ">", "Hello", "<", "span", ">", "world", "</", "span", ">", "<", "div", ">"]
I don't have anything for the tokenize method but was wondering if iterating over the html character by character is the best way to build the list..
void Tokenizer::tokenize(std::string html){
std::list<std::string> tokens;
for(int i = 0; i < html.length();i++){
char c = html[i];
if(...){
...
}
}
}
I think what you are looking for is a lexical analyzer. Its goal is getting all the tokens that are defined in your language, in this case is HTML. As #IraBaxter said, you can use a Lexical tool, like Lex, that is founded in Linux or OSX; but you must define the rule and, for this, you need use Regular Expressions.
But, if you wan to know about an algorithm for this issue you can check the book of Keith D. Cooper & Linda Torczon, chapter 2, Scanners. This chapter talks about Automatas and who they can be used to create a Scanner where it use a Table-Driven Scanner to get tokens, like you want. Let me share you an image of this chapter:
The idea is that you define a DFA where you have:
A finite set of states in the recognizer, including start state, accepting states and error state.
An Alfabet.
A function which helps to determine if a transition is valid or not, using the table of transitions or, if you don't want use a table, coding the automata.
Take a time to study this chapter.
The other answers here are great, and you should definitely use a lexical-analyzer-generator like flex for the job. The input to such a generator is a list of rules that identify the different token types. An input file might look like this:
WHITE_SPACE \s*
IDENTIFIER [a-zA-Z0-9_]+
LEFT_ANGLE <
The algorithm that flex uses is essentially:
Find the rule that matches the most text.
If two rules match the same length of text, choose the one that occurs earlier in the list of rules provided.
You could write this algorithm quite easily yourself using regular expressions. However, do remember that this will not be as fast as flex, since flex compiles the regular expressions away into a very fast DFA.

D: split string by comma, but not quoted string

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Parse tab delimited file with Boost.Spirit where entries may contain whitespace in

I want to parse a tab delimited file using Boost.Spirit (Qi). My file looks something like this:
John Doe\tAge 23\tMember
Jane Doe\tAge 25\tMember
...
Is it possible to parse this with a skip parser? The problem I have right now is, that boost::spirit::ascii:space also skips the whitespace within the name of the person. How would the phrase_parse(...) call look like?
I am also using the Boost.Fusion tuples for convient storing of the results in a struct:
struct Person
{
string name;
int age;
string status;
};
This seems to work for the name:
String %= lexeme[+(char_-'\t')];
It matches everything char that is not a tab. It is then used as part of the bigger rule:
Start %= Name >> Age >> Status;
Q. Is it possible to parse this with a skip parser?
A. No, it's not possible to parse anything with the skip parser. Skippers achieve the opposite: they disregard certain input information.
However, what you seem to be looking for something like this hack: (I don't recommend it)
Read empty values with boost::spirit
Now, you could look at my other answers for proper ways to parse CSV/TSV dealing with embedded whitespace, quoted values, escaped quotes etc. (I believe one even shows line-continuation characters)
How to parse csv using boost::spirit
Parse quoted strings with boost::spirit
How to make my split work only on one real line and be capable to skip quoted parts of string?