Parsing a CSV string while ignoring commas inside the individual columns - regex

I am trying to split a csv string with comma as delimiter.
val string ="A,B,"Hi,There",C,D"
I cannot use string.split(",") because it will split "Hi,There" as two different columns. Can I use regex to solve this? I came around scala-csv parser which I dont want to use. I hope there is a better method to solve this problem.I know this is not a trivial problem. It'll be helpful if people can share their approaches to solve this problem.

I agree with Jeronimo Backes, csv parsing is not trivial and it much better to use a library rather than reinvent the wheel.
Besides uniVocity-parsers there are some other more scala orientated libraries available (underlying parser indicated):
product-collections (native / scala)
scala-csv (native / java-scala)
mighty-csv (opencsv)
PureCSV (opencsv)
object-csv (scala-csv)
product-collections, my own project, is well tested against the same data as univocity and also against csv spectrum. It is strongly typed, reflection free and compatible with scala-js. It's tested for performance against most of the java equivalents.
The other projects listed all have their strengths. Scala-csv is very difficult to call from java without shims, so although I've tested it internally I was not able to make a pull request against csv-parsers-comparison.
Product-collections used to leverage opencsv but in order to make it scala-js compatible it now contains a native parser. The parser performs better than opencsv (speed, correctness) in all the scenarios I tested.

Use uniVocity-parsers CsvParser for that instead of parsing it by hand. CSV is much harder than you think and there are many corner cases to cover. You just found one. In short, you NEED a library to read CSV reliably. uniVocity-parsers is used by other Scala projects (e.g. spark-csv)
I'll put an example using plain Java here, because I don't know Scala, but you'll get the idea:
public static void main(String ... args){
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the documentation
CsvParser parser = new CsvParser(settings);
String[] row = parser.parseLine("A,B,\"Hi,There\",C,D");
for(String value : row){
System.out.println(value);
}
}
Output:
A
B
Hi,There
C
D
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

This regex covers your example, and possibly others, but certainly not robust:
(?:,?(".+?"))|(?:,?(.+?),?)
Here'a a demo on regex101: https://regex101.com/r/wM7uW4/1

Related

How to parse the java comments of a groovy file to html format?

I have a set of .groovy files (Java). All of these files have the same comment format.
I developped a tool with wich I'm able to read those files and applying a REGEX to get all the comments in a list. (Finally i just have to copy paste these comments to .html file)
I would like to know if it's a correct practice in order to generate a HTML page with the comment (a kind of documentation). If not, what would you recommend ?
I read about Doxygen and Javadoc but i'm not sure about using them (if they can be really useful in my case since the comments are already written)
If you can suggest a library in order to generate easily a HTML Webpage or any other advice.
Any help is appreciated.
There exists Groovydoc which is roughly the equivalent of Javadoc, just for Groovy.
As your setup is not that (you already have comments, probably not in Groovydoc format, and you have half the tooling), there are still multiple ways open to you. As you already extract the documentation from groovy, if I were you, I would do a minimal post-formatting, if necessary, and output the documentation as markdown (e.g., github markdown) or asciidoc (e.g., asciidoctor). Then you can use any preferred tool to convert the post-formatted documentation into HTML.
To answer the question "How to parse the java comments" – you shouldn't. If possible, especially in a new project, stick with the standard tooling. In the case of Groovy that's Groovydoc. The normal (non Java/Groovy-Doc style) comments themselves you should never need to extract from the source code. They should be so much context-specific, that without the corresponding code they are anyways useless.

Line by line parsing a huge XML file by a light weight parser

I'm writing a small stand alone tool for Linux which needs read a huge xml-file.
The xml-file has simple structure and a progressive or streaming (line by line) parser is suitable for it.
I want to use a light-weight class library such as TinyXML but I don't know it supports progressive parsing or not?!
If the answer is "yes", Do you have a sample? And, if the answer is "no", Do you know another alternative for it which is small and header only class library?
Update: How about RapidXML or pugiXML?
Sounds like libxml's XmlReader interface is just what you want. Fast, simple, and streaming. Light-weight and XML don't mix, unfortunately. I prefer XmlReader's pull model to SAX's push model, but they'll both do what you want.
In the pull model, you call a function and get a new node, then check yourself if it matches. In the push model, you supply callbacks and SAX calls them as it finds nodes matching them.
TinyXML, last I checked, is not standards-compliant -- I would avoid it.

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

How to start using xml with C++

(Not sure if this should be CW or not, you're welcome to comment if you think it should be).
At my workplace, we have many many different file formats for all kinds of purposes. Most, if not all, of these file formats are just written in plain text, with no consistency. I'm only a student working part-time, and I have no experience with using xml in production, but it seems to me that using xml would improve productivity, as we often need to parse, check and compare these outputs.
So my questions are: given that I can only control one small application and its output (only - the inputs are formats that are used in other applications as well), is it worth trying to change the output to be xml-based? If so, what are the best known ways to do that in C++ (i.e., xml parsers/writers, etc.)? Also, should I also provide a plain-text output to make it easy for the users (which are also programmers) to get used to xml? Should I provide a script to translate xml-plaintext? What are your experiences with this subject?
Thanks.
Don't just use XML because it's XML.
Use XML because:
other applications (that only accept XML) are going to read your output
you have an hierarchical data structure that lends itself perfectly for XML
you want to transform the data to other formats using XSL (e.g. to HTML)
EDIT:
A nice personal experience:
Customer: your application MUST be able to read XML.
Me: Er, OK, I will adapt my application so it can read XML.
Same customer (a few days later): your application MUST be able to read fixed width files, because we just realized our mainframe cannot generate XML.
Amir, to parse an XML you can use TinyXML which is incredibly easy to use and start with. Check its documentation for a quick brief, and read carefully the "what it does not do" clause. Been using it for reading and all I can say is that this tiny library does the job, very well.
As for writing - if your XML files aren't complex you might build them manually with a string object. "Aren't complex" for me means that you're only going to store text at most.
For more complex XML reading/writing you better check Xerces which is heavier than TinyXML. I haven't used it yet I've seen it in production and it does deliver it.
You can try using the boost::property_tree class.
http://www.boost.org/doc/libs/1_43_0/doc/html/property_tree.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/tutorial.html
http://www.boost.org/doc/libs/1_43_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser
It's pretty easy to use, but the page does warn that it doesn't support the XML format completely. If you do use this though, it gives you the freedom to easily use XML, INI, JSON, or INFO files without changing more than just the read_xml line.
If you want that ability though, you should avoid xml attributes. To use an attribute, you have to look at the key , which won't transfer between filetypes (although you can manually create your own subnodes).
Although using TinyXML is probably better. I've seen it used before in a couple of projects I've worked on, but don't have any experience with it.
Another approach to handling XML in your application is to use a data binding tool, such as CodeSynthesis XSD. Such a tool will generate C++ classes that hide all the gory details of parsing/serializing XML -- all that you see are objects corresponding to your XML vocabulary and functions that you can call to get/set the data, for example:
Person p = person ("person.xml");
cout << p.name ();
p.name ("John");
p.age (30);
ofstream ofs ("person.xml");
person (ofs, p);
Here's what previous SO threads have said on the topic. Please add others you know of that are relevant:
What is the best open XML parser for C++?
What is XML good for and when should i be using it?
What are good alternative data formats to XML?
BTW, before you decide on an XML parser, you may want to make sure that it will actually be able to parse all XML documents instead of just the "simple" ones, as discussed in this article:
Are you using a real XML parser?

C++ Transformer scripting

Im looking to see if there are any pre-existing projects that do this.
Generally, I need something that will load in a c++ file and parse it and then based on a set of rules in script, transform it, say to add headers, reformat, or remove coding quirks for example, turning const int parameters in functions to int parameters, etc Or perhaps something that would generate a dom of some sorts based on the c++ file fed in that could be manipulated and written out again.
Are there any such projects/products out there free or commercial?
Taras Glek of Mozilla has been working on the dehydra tool, based on Elkhound and scripted using JavaScript to transform the Mozilla codebase to fit with XPCOM and garbage collector changes.
The Parser from Eclipse CDT seems to be pretty complete by now, as some refactoring methods have been alredy contributed to CDT.