Can I include cppcheck suppression within a function header? - c++

I have added an inline comment to suppress a cppcheck unusedFunction warning for a function, but I would like to include this within the function header so that Doxygen can document all of the unused functions (I am implementing an API, so I have many functions that will not be used in my source). I would prefer not to suppress all unusedFunction errors, but rather on a per-function basis.
I would like to do something like this:
/**
* API function description
*
* #param p1 function pointer to the ...
* #return 0 if successful, -1 otherwise.
* // cppcheck-suppress unusedFunction
*/
int CreateTask(Task_FuncPtr p1)
{
return doSomething();
}
But when I do this, cppcheck does not "see" the inline suppression. If I move it outside the header, but just before the function declaration then the suppression works. The cppcheck documentation seems to imply that the suppression needs to be directly before the line generating then error.
Has any had success with this?

Taking a look at cppcheck sources (file preprocessor.cpp function RemoveComments()), it seems that you cannot do it.
The code to identify comments is:
if (str.compare(i, 2, "//") == 0) { /* ... */ }
and
else if (str.compare(i, 2, "/*") == 0) { /* ... */ }
When a comment is found, the code that manages warning suppression is:
if (_settings && _settings->_inlineSuppressions) {
std::istringstream iss(comment);
std::string word;
iss >> word;
if (word == "cppcheck-suppress") {
iss >> word;
if (iss)
suppressionIDs.push_back(word);
}
}
So cppcheck will skip spaces and check the first token immediately after // or /*.
Unfortunately Doxygen's special comment blocks start with /**, ///, /*! or //! and the third character prevents the "correct match".
Changing:
if (word == "cppcheck-suppress") { /* ... */ }
into:
if (contains(word, "cppcheck-suppress")) { /* ... */ }
// or if (ends_with(word, "cppcheck-suppress"))
should allow what you want:
/**
* API function description
*
* #param p1 function pointer to the ...
* #return 0 if successful, -1 otherwise.
*/
/** cppcheck-suppress unusedFunction */
or
/// API function description
///
/// #param p1 function pointer to the ...
/// #return 0 if successful, -1 otherwise.
///
/// cppcheck-suppress unusedFunction
You can probably open a ticket on http://trac.cppcheck.net/

Related

Error compiling the first example code of FreeLing (C++)

I've installed 'freeling-4.2-focal-amd64.deb' on 'Linux Mint 20.3 Cinnamon' and also all the dependencies mentioned here:
https://freeling-user-manual.readthedocs.io/en/v4.2/installation/requirements-linux/#install-dependencies
I tried to compile the first example code from here:
Explanation: https://freeling-tutorial.readthedocs.io/en/latest/example01/
Code: https://freeling-tutorial.readthedocs.io/en/latest/code/example01.cc/
The following Error message appeared while compiling:
FAILED: CMakeFiles/test_freeling.dir/main.cpp.o
/usr/bin/c++ -g -std=gnu++14 -MD -MT CMakeFiles/test_freeling.dir/main.cpp.o -MF CMakeFiles/test_freeling.dir/main.cpp.o.d -o CMakeFiles/test_freeling.dir/main.cpp.o -c /home/ben/Schreibtisch/Bachelor_Projekt/test_freeling/main.cpp
In file included from /usr/include/freeling/morfo/idioma.h:45,
from /usr/include/freeling/morfo/lang_ident.h:44,
from /usr/include/freeling.h:35,
from /home/ben/desktop/Projekt/test_freeling/main.cpp:2:
/usr/include/freeling/morfo/smoothingLD.h: In constructor ‘freeling::smoothingLD<G, E>::smoothingLD(const wstring&, const std::map<std::__cxx11::basic_string<wchar_t>, E>&)’:
/usr/include/freeling/morfo/smoothingLD.h:129:73: error: there are no arguments to ‘log’ that depend on a template parameter, so a declaration of ‘log’ must be available [-fpermissive]
129 | if (name==L"LinearDiscountAlpha") { double a; sin>>a; alpha = log(a); notalpha=log(1-a); }
| ^~~
/usr/include/freeling/morfo/smoothingLD.h:129:73: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/usr/include/freeling/morfo/smoothingLD.h:129:90: error: there are no arguments to ‘log’ that depend on a template parameter, so a declaration of ‘log’ must be available [-fpermissive]
129 | if (name==L"LinearDiscountAlpha") { double a; sin>>a; alpha = log(a); notalpha=log(1-a); }
| ^~~
/usr/include/freeling/morfo/smoothingLD.h:179:18: error: there are no arguments to ‘log’ that depend on a template parameter, so a declaration of ‘log’ must be available [-fpermissive]
179 | pUnseen = -log(vsize-ntypes); // log version of 1/(vsize-ntypes)
| ^~~
/usr/include/freeling/morfo/smoothingLD.h:180:14: error: there are no arguments to ‘log’ that depend on a template parameter, so a declaration of ‘log’ must be available [-fpermissive]
180 | nobs = log(nobs);
| ^~~
Here is the corresponding source code:
////////////////////////////////////////////////////////////////
//
// FreeLing - Open Source Language Analyzers
//
// Copyright (C) 2014 TALP Research Center
// Universitat Politecnica de Catalunya
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of the GNU Affero General Public
// License as published by the Free Software Foundation; either
// version 3 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public
// License along with this library; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//
// contact: Lluis Padro (padro#lsi.upc.es)
// TALP Research Center
// despatx C6.212 - Campus Nord UPC
// 08034 Barcelona. SPAIN
//
////////////////////////////////////////////////////////////////
#ifndef _SMOOTHING_LD
#define _SMOOTHING_LD
#include "freeling/morfo/traces.h"
#include "freeling/morfo/util.h"
#include "freeling/morfo/configfile.h"
#undef MOD_TRACENAME
#define MOD_TRACENAME L"SMOOTHING"
#undef MOD_TRACEMODULE
#define MOD_TRACEMODULE LANGIDENT_TRACE
namespace freeling {
///////////////////////////////////////////////////////////////
/// Class smoothingLD computes linear-discount smoothed conditional
/// probabilities P(z|w1...wn) for n-gram transitions.
///
/// template parameters:
/// - E is the type of the ngram elements (e.g. char, int, string, etc.)
/// - G is the type containing the ngram (e.g. string, list<int>, vector<string>, etc.)
/// Elements of G must be of type E.
/// G must have operations push_back(E), erase(G::iterator), and begin()
///////////////////////////////////////////////////////////////
template <class G,class E>
class smoothingLD {
private:
/// log alpha and 1-alpha parameter for linear discount
double alpha;
double notalpha;
/// map to store ngram counts (for any size of ngram)
std::map<G,double> counts;
// probability of unseen unigrams
double pUnseen;
// total number of observations
double nobs;
/// order of the n-gram model. Order=3 => trigram model P(z|xy)
size_t order;
/// translation table for escaped symbols in input (e.g. \s, \n, \t...)
std::map<std::wstring,E> escapes;
//////////////////////////////////////////
// Get observed counts for given ngram
double count(const G &ngram) const {
typename std::map<G, double>::const_iterator p = counts.find(ngram);
if (p!=counts.end()) return p->second;
else return -1;
}
public:
//////////////////////////////////////////
/// Constructor, load data from config file
smoothingLD(const std::wstring &cfgFile,
const std::map<std::wstring,E> &esc=std::map<std::wstring,E>()) : escapes(esc) {
double ntypes=0;
double vsize=0;
nobs=0;
order=0;
enum sections {ORDER,NGRAMS,SMOOTHING};
config_file cfg(true);
cfg.add_section(L"Order",ORDER,true);
cfg.add_section(L"NGrams",NGRAMS,true);
cfg.add_section(L"Smoothing",SMOOTHING,true);
if (not cfg.open(cfgFile))
ERROR_CRASH(L"Error opening file "+cfgFile);
std::wstring line;
while (cfg.get_content_line(line)) {
std::wistringstream sin;
sin.str(line);
// process each content line according to the section where it is found
switch (cfg.get_section()) {
case ORDER: {// read order of ngram model
std::wistringstream sin;
sin.str(line);
size_t x; sin>>x;
if (order!=0 and order!=x)
ERROR_CRASH(L"ERROR - Specified model order does not match ngram size");
order = x;
break;
}
case SMOOTHING: { // reading general parameters
std::wstring name;
sin>>name;
if (name==L"LinearDiscountAlpha") { double a; sin>>a; alpha = log(a); notalpha=log(1-a); }
else if (name==L"VocabularySize") sin>>vsize;
else ERROR_CRASH(L"Unexpected smoothing option '"+name+L"'");
break;
}
case NGRAMS: { // reading ngram counts
// read counts for this ngram to table
double c; sin >> c;
// read ngram components into a G object.
G ngram;
std::wstring w;
while (sin >> w) {
typename std::map<std::wstring,E>::const_iterator p=escapes.find(w);
if (p!=escapes.end())
ngram.push_back(p->second);
else
ngram.push_back(util::wstring_to<E>(w));
}
if (order==0) order = ngram.size();
else if (order != ngram.size())
ERROR_CRASH(L"ERROR - Mixed order ngrams in input file, or specified model order does not match ngram size");
// add ngram (and n-i gram) counts to the model
while (ngram.size()>1) {
// insert ngram count, or increase if it already existed
std::pair<typename std::map<G,double>::iterator,bool> x = counts.insert(make_pair(ngram,c));
if (not x.second) x.first->second += c;
// shorten n gram and loop to insert n-1 gram
ngram.erase(std::prev(ngram.end()));
}
// unigram is left. Add it
std::pair<typename std::map<G,double>::iterator,bool> x = counts.insert(make_pair(ngram,c));
if (x.second) ntypes++; // new unigram inserted, increase type count
else x.first->second += c; // existing unigram, increase count
// update total observation counts
nobs += c;
break;
}
default: break;
}
}
cfg.close();
// precompute logs needed for logprob
if (vsize<=ntypes)
ERROR_CRASH(L"VocabularySize can not be smaller than number of different observed unigrams.");
pUnseen = -log(vsize-ntypes); // log version of 1/(vsize-ntypes)
nobs = log(nobs);
for (typename std::map<G,double>::iterator c=counts.begin(); c!=counts.end(); c++)
c->second = log(c->second);
}
//////////////////////////////////////////
/// destructor
~smoothingLD() {}
//////////////////////////////////////////
/// Compute smoothed conditional log prob of seeing
/// symbol z following given ngram P(z|ngram)
double Prob(const G &ngram, const E &z) const {
// seq = n+1 gram ( ngram + z )
G seq = ngram;
seq.push_back(z);
// log count of complete ngram
double c = count(seq);
if (ngram.size()==0) {
// no conditioning history, use unigrams (seq = [z])
if (c>=0) return notalpha + c - nobs; // log version of (1-alpha)*count(c)/nobs
else return alpha + pUnseen; // log version of alpha * punseen
}
else {
// conditioning history, use LD smoothing
if (c>=0) return notalpha + c - count(ngram); // log version of (1-alpha)*count(c)/count(ngram)
else {
// shorten history and recurse
G sht = ngram; sht.erase(sht.begin());
return alpha + Prob(sht,z); // log version of alpha * Prob(sht,z)
}
}
}
};
} // namespace
#undef MOD_TRACENAME
#undef MOD_TRACECODE
#endif
Why isn't log recognized in this file of the lib and what can I do about it?
It seems the library has not included the proper header file (<cmath>) and that it's using std::log as-if it is guaranteed to be available as log.
I suggest that you try to get the providers of FreeLing to update the library and while waiting for them to fix the issues, you can try to do this:
#include <cmath>
using std::log;
before including any of the FreeLing headers in your own code.

GCC Plugin: Is it possible to move a sequence of a basic block of one function to a basic block of another function?

I'm trying to create a plugin for gcc that allows you to instrument the prologue and the epilogue of a function.
The instrumentation code is inserted in two functions "instrument_entry" and "instrument_exit". These two functions are written in a file called instrumentation.h, which is included in the source code of the software I want to instrument.
in a nutshell, it's very similar to the
-finstrument-function of gcc
Unlike the option provided by gcc, through the plugin, I would like to take the code present in the function "instrument_entry" (I checked and this function contains only one basic block) and insert it in all other prologues of the functions present in the software.
I thought to take the sequence present in the basic block of the function "instrument_entry" and insert it in the first basic block of the function to be instrumented.
This is the code:
static basic_block bb_entry_instr;
static unsigned int instrument_functions(function *fun){
std::string func_name = function_name(fun);
if(func_name == "instrument_entry"){
basic_block bb;
bb = ENTRY_BLOCK_PTR_FOR_FN(fun);
bb = bb->next_bb;
bb_entry_instr = bb;
std::cout << "Instrumentation code found! " << std::endl;
}
if(func_name != "instrument_entry" && func_name != "instrument_exit"){
basic_block bb;
bb = ENTRY_BLOCK_PTR_FOR_FN(fun);
bb = bb->next_bb;
gimple_stmt_iterator gsi = gsi_start_bb(bb);
gsi_insert_seq_before(&gsi, bb_seq(bb_entry_instr), GSI_NEW_STMT);
}
} // end of instrument_functions()
bb_entry_instr is the basic block of the "instrument_entry" function
The pass I created for gcc is called after the "cfg" pass:
namespace {
const pass_data instrumentation_pass_data = {
GIMPLE_PASS,
"instr_pass2", /* name */
OPTGROUP_NONE, /* optinfo_flags */
TV_NONE, /* tv_id */
PROP_gimple_any, /* properties_required */
0, /* properties_provided */
0, /* properties_destroyed */
0, /* todo_flags_start */
0 /* todo_flags_finish */
};
struct instrumentation_pass : gimple_opt_pass {
instrumentation_pass(gcc::context *ctx) : gimple_opt_pass(instrumentation_pass_data, ctx){}
unsigned int execute(function *fun) {
return instrument_functions(fun);
}
};
}
int plugin_init(struct plugin_name_args *plugin_info, struct plugin_gcc_version *version){
... // plugin_default_version_check
struct register_pass_info instr_pass_info;
instr_pass_info.pass = new instrumentation_pass(g);
instr_pass_info.reference_pass_name = "cfg";
instr_pass_info.ref_pass_instance_number = 1;
instr_pass_info.pos_op = PASS_POS_INSERT_AFTER;
}
register_callback(plugin_info->base_name, PLUGIN_PASS_MANAGER_SETUP, NULL, &instr_pass_info);
When I try to compile a test program with the plugin, I get this error:
during RTL pass: expand
test_01.c: In function ‘instrument_entry’:
test_01.c:13:9: internal compiler error: in get_rtx_for_ssa_name, at tree-outof-ssa.h:62
13 | fgets(name, 0xff, stdin);
| ^~~~~~~~~~~~~~~~~~~~~~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-9/README.Bugs> for instructions.
make: *** [Makefile:14: test] Error 1
Can somebody help me?

Is there an equivalent for Glob in D Phobos?

In python I can use glob to search path patterns. This for instance:
import glob
for entry in glob.glob("/usr/*/python*"):
print(entry)
Would print this:
/usr/share/python3
/usr/share/python3-plainbox
/usr/share/python
/usr/share/python-apt
/usr/include/python3.5m
/usr/bin/python3
/usr/bin/python3m
/usr/bin/python2.7
/usr/bin/python
/usr/bin/python3.5
/usr/bin/python3.5m
/usr/bin/python2
/usr/lib/python3
/usr/lib/python2.7
/usr/lib/python3.5
How would I glob or make a glob equivalent in in D?
------Updated on Sep 12 2017------
I wrote a small D module to do Glob in D: https://github.com/workhorsy/d-glob
If you only work on a Posix system, you can directly call glob.h. Here's a simple example that shows how easy it is to interface with the Posix API:
void main()
{
import std.stdio;
import glob : glob;
foreach(entry; glob("/usr/*/python*"))
writeln(entry);
}
You can compile this e.g. with rdmd main.d (rdmd does simple dependency management) or dmd main.d glob.d and it yields a similar output as yours on my machine.
glob.d was generated by dstep and is enhanced with a convenience D-style wrapper (first function). Please note that this isn't perfect and a better way would be to expose a range API instead of allocating the entire array.
/* Copyright (C) 1991-2016 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
string[] glob(string pattern)
{
import std.string;
string[] results;
glob_t glob_result;
glob(pattern.toStringz, 0, null, &glob_result);
for (uint i = 0; i < glob_result.gl_pathc; i++)
{
results ~= glob_result.gl_pathv[i].fromStringz().idup;
}
globfree(&glob_result);
return results;
}
import core.stdc.config;
extern (C):
enum _GLOB_H = 1;
/* We need `size_t' for the following definitions. */
alias c_ulong __size_t;
alias c_ulong size_t;
/* The GNU CC stddef.h version defines __size_t as empty. We need a real
definition. */
/* Bits set in the FLAGS argument to `glob'. */
enum GLOB_ERR = 1 << 0; /* Return on read errors. */
enum GLOB_MARK = 1 << 1; /* Append a slash to each name. */
enum GLOB_NOSORT = 1 << 2; /* Don't sort the names. */
enum GLOB_DOOFFS = 1 << 3; /* Insert PGLOB->gl_offs NULLs. */
enum GLOB_NOCHECK = 1 << 4; /* If nothing matches, return the pattern. */
enum GLOB_APPEND = 1 << 5; /* Append to results of a previous call. */
enum GLOB_NOESCAPE = 1 << 6; /* Backslashes don't quote metacharacters. */
enum GLOB_PERIOD = 1 << 7; /* Leading `.' can be matched by metachars. */
enum GLOB_MAGCHAR = 1 << 8; /* Set in gl_flags if any metachars seen. */
enum GLOB_ALTDIRFUNC = 1 << 9; /* Use gl_opendir et al functions. */
enum GLOB_BRACE = 1 << 10; /* Expand "{a,b}" to "a" "b". */
enum GLOB_NOMAGIC = 1 << 11; /* If no magic chars, return the pattern. */
enum GLOB_TILDE = 1 << 12; /* Expand ~user and ~ to home directories. */
enum GLOB_ONLYDIR = 1 << 13; /* Match only directories. */
enum GLOB_TILDE_CHECK = 1 << 14; /* Like GLOB_TILDE but return an error
if the user name is not available. */
enum __GLOB_FLAGS = GLOB_ERR | GLOB_MARK | GLOB_NOSORT | GLOB_DOOFFS | GLOB_NOESCAPE | GLOB_NOCHECK | GLOB_APPEND | GLOB_PERIOD | GLOB_ALTDIRFUNC | GLOB_BRACE | GLOB_NOMAGIC | GLOB_TILDE | GLOB_ONLYDIR | GLOB_TILDE_CHECK;
/* Error returns from `glob'. */
enum GLOB_NOSPACE = 1; /* Ran out of memory. */
enum GLOB_ABORTED = 2; /* Read error. */
enum GLOB_NOMATCH = 3; /* No matches found. */
enum GLOB_NOSYS = 4; /* Not implemented. */
/* Previous versions of this file defined GLOB_ABEND instead of
GLOB_ABORTED. Provide a compatibility definition here. */
/* Structure describing a globbing run. */
struct glob_t
{
__size_t gl_pathc; /* Count of paths matched by the pattern. */
char** gl_pathv; /* List of matched pathnames. */
__size_t gl_offs; /* Slots to reserve in `gl_pathv'. */
int gl_flags; /* Set to FLAGS, maybe | GLOB_MAGCHAR. */
/* If the GLOB_ALTDIRFUNC flag is set, the following functions
are used instead of the normal file access functions. */
void function (void*) gl_closedir;
void* function (void*) gl_readdir;
void* function (const(char)*) gl_opendir;
int function (const(char)*, void*) gl_lstat;
int function (const(char)*, void*) gl_stat;
}
/* If the GLOB_ALTDIRFUNC flag is set, the following functions
are used instead of the normal file access functions. */
/* Do glob searching for PATTERN, placing results in PGLOB.
The bits defined above may be set in FLAGS.
If a directory cannot be opened or read and ERRFUNC is not nil,
it is called with the pathname that caused the error, and the
`errno' value from the failing call; if it returns non-zero
`glob' returns GLOB_ABEND; if it returns zero, the error is ignored.
If memory cannot be allocated for PGLOB, GLOB_NOSPACE is returned.
Otherwise, `glob' returns zero. */
int glob (
const(char)* __pattern,
int __flags,
int function (const(char)*, int) __errfunc,
glob_t* __pglob);
/* Free storage allocated in PGLOB by a previous `glob' call. */
void globfree (glob_t* __pglob);
None of the answers above worked 100% the same as Glob on Windows and Linux. So I made a small D module that does Glob the right way. Hopefully people find it useful:
https://github.com/workhorsy/d-glob
import std.stdio : stdout;
import glob : glob;
foreach (entry ; glob("/usr/*/python*")) {
stdout.writefln("%s", entry);
}
Maybe you are looking for std.file.dirEntries().
Here's example from the documentation:
// Iterate over all D source files in current directory and all its
// subdirectories
auto dFiles = dirEntries("","*.{d,di}",SpanMode.depth);
foreach(d; dFiles)
writeln(d.name);
Basically you don't need to do all this complicated stuff with headers, C and so on, and so forth. This should do the thing:
auto dirIter = dirEntries("/usr", "*/python*", SpanMode.shallow);
foreach(dirFile; dirIter) {
// Process the result as needed

Import CSV into Vertica using Rfc4180CsvParser and exclude header row

Is there a way to exclude the header row when importing data via the Rfc4180CsvParser? The COPY command has a SKIP option but the option doesn't seem to work when using the CSV parsers provided in the Vertica SDK.
Background
As background, the COPY command does not read CSV files by itself. For simple CSV files, one can say COPY schema.table FROM '/data/myfile.csv' DELIMITER ',' ENCLOSED BY '"'; but this will fail with data files which have string values with embedded quotes.
Adding ESCAPE AS '"' will generate an error ERROR 3169: ENCLOSED BY and ESCAPE AS can not be the same value . This is a problem as CSV values are enclosed and escaped by ".
Vertica SDK CsvParser extensions to the rescue
Vertica provides an SDK under /opt/vertica/sdk/examples with C++ programs that can be compiled into extensions. One of these is /opt/vertica/sdk/examples/ParserFunctions/Rfc4180CsvParser.cpp.
This works great as follows:
cd /opt/vertica/sdk/examples
make clean
vsql
==> CREATE LIBRARY Rfc4180CsvParserLib AS '/opt/vertica/sdk/examples/build/Rfc4180CsvParser.so';
==> COPY myschema.mytable FROM '/data/myfile.csv' WITH PARSER Rfc4180CsvParser();
Problem
The above works great except that it imports the first row of the data file as a row. The COPY command has a SKIP 1 option but this does not work with the parser.
Question
Is it possble to edit Rfc4180CsvParser.cpp to skip the first row, or better yet, take some parameter to specify number of rows to skip?
The program is just 135 lines but I don't see where/how to make this incision. Hints?
Copying the entire program below as I don't see a public repo to link to...
Rfc4180CsvParser.cpp
/* Copyright (c) 2005 - 2012 Vertica, an HP company -*- C++ -*- */
#include "Vertica.h"
#include "StringParsers.h"
#include "csv.h"
using namespace Vertica;
// Note, the class template is mostly for demonstration purposes,
// so that the same class can use each of two string-parsers.
// Custom parsers can also just pick a string-parser to use.
/**
* A parser that parses something approximating the "official" CSV format
* as defined in IETF RFC-4180: <http://tools.ietf.org/html/rfc4180>
* Oddly enough, many "CSV" files don't actually conform to this standard
* for one reason or another. But for sources that do, this parser should
* be able to handle the data.
* Note that the CSV format does not specify how to handle different
* data types; it is entirely a string-based format.
* So we just use standard parsers based on the corresponding column type.
*/
template <class StringParsersImpl>
class LibCSVParser : public UDParser {
public:
LibCSVParser() : colNum(0) {}
// Keep a copy of the information about each column.
// Note that Vertica doesn't let us safely keep a reference to
// the internal copy of this data structure that it shows us.
// But keeping a copy is fine.
SizedColumnTypes colInfo;
// An instance of the class containing the methods that we're
// using to parse strings to the various relevant data types
StringParsersImpl sp;
/// Current column index
size_t colNum;
/// Parsing state for libcsv
struct csv_parser parser;
// Format strings
std::vector<std::string> formatStrings;
/**
* Given a field in string form (a pointer to the first character and
* a length), submit that field to Vertica.
* `colNum` is the column number from the input file; how many fields
* it is into the current record.
*/
bool handleField(size_t colNum, char* start, size_t len) {
if (colNum >= colInfo.getColumnCount()) {
// Ignore column overflow
return false;
}
// Empty colums are null.
if (len==0) {
writer->setNull(colNum);
return true;
} else {
return parseStringToType(start, len, colNum, colInfo.getColumnType(c
olNum), writer, sp);
}
}
static void handle_record(void *data, size_t len, void *p) {
static_cast<LibCSVParser*>(p)->handleField(static_cast<LibCSVParser*>(p)
->colNum++, (char*)data, len);
}
static void handle_end_of_row(int c, void *p) {
// Ignore 'c' (the terminating character); trust that it's correct
static_cast<LibCSVParser*>(p)->colNum = 0;
static_cast<LibCSVParser*>(p)->writer->next();
}
virtual StreamState process(ServerInterface &srvInterface, DataBuffer &input
, InputState input_state) {
size_t processed;
while ((processed = csv_parse(&parser, input.buf + input.offset, input.s
ize - input.offset,
handle_record, handle_end_of_row, this)) > 0) {
input.offset += processed;
}
if (input_state == END_OF_FILE && input.size == input.offset) {
csv_fini(&parser, handle_record, handle_end_of_row, this);
return DONE;
}
return INPUT_NEEDED;
}
virtual void setup(ServerInterface &srvInterface, SizedColumnTypes &returnTy
pe);
virtual void destroy(ServerInterface &srvInterface, SizedColumnTypes &return
Type) {
csv_free(&parser);
}
};
template <class StringParsersImpl>
void LibCSVParser<StringParsersImpl>::setup(ServerInterface &srvInterface, Sized
ColumnTypes &returnType) {
csv_init(&parser, CSV_APPEND_NULL);
colInfo = returnType;
}
template <>
void LibCSVParser<FormattedStringParsers>::setup(ServerInterface &srvInterface,
SizedColumnTypes &returnType) {
csv_init(&parser, CSV_APPEND_NULL);
colInfo = returnType;
if (formatStrings.size() != returnType.getColumnCount()) {
formatStrings.resize(returnType.getColumnCount(), "");
}
sp.setFormats(formatStrings);
}
template <class StringParsersImpl>
class LibCSVParserFactoryTmpl : public ParserFactory {
public:
virtual void plan(ServerInterface &srvInterface,
PerColumnParamReader &perColumnParamReader,
PlanContext &planCtxt) {}
virtual UDParser* prepare(ServerInterface &srvInterface,
PerColumnParamReader &perColumnParamReader,
PlanContext &planCtxt,
const SizedColumnTypes &returnType)
{
return vt_createFuncObj(srvInterface.allocator,
LibCSVParser<StringParsersImpl>);
}
};
typedef LibCSVParserFactoryTmpl<StringParsers> LibCSVParserFactory;
RegisterFactory(LibCSVParserFactory);
typedef LibCSVParserFactoryTmpl<FormattedStringParsers> FormattedLibCSVParserFac
tory;
RegisterFactory(FormattedLibCSVParserFactory);
The quick and dirty way would be to just hardcode it. It's using a callback to handle_end_of_row. Track the row number and just don't process the first row . Something like:
static void handle_end_of_row(int c, void *ptr) {
// Ignore 'c' (the terminating character); trust that it's correct
LibCSVParser *p = static_cast<LibCSVParser*>(ptr);
p->colNum = 0;
if (rowcnt <= 0) {
p->bad_field = "";
rowcnt++;
} else if (p->bad_field.empty()) {
p->writer->next();
} else {
// libcsv doesn't give us the whole row to reject.
// So just write to the log.
// TODO: Come up with something more clever.
if (p->currSrvInterface) {
p->currSrvInterface->log("Invalid CSV field value: '%s' Row skipped.",
p->bad_field.c_str());
}
p->bad_field = "";
}
}
Also, best to initialize rownum = 0 in process since I think it will call this for each file in your COPY statement. There might be more clever ways of doing this. Basically, this will just process the record and then discard it.
As for supporting SKIP generically... look at TraditionalCSVParser for how to handle parameter passing. You'd have to add it to the parser factor prepare and send in the value to the LibCSVParser class and override getParameterType. Then in LibCSVParser you need to accept the parameter in the constructor, and modify process to skip the first skip rows. Then use that value instead of the hardcoded 0 above.

C++ scanner.h scan content between double-quotes as a token: not skipping spaces inside quotes

I'm trying to get the content between a double-quote to count as one token for an assignment.
For example:
"hello world" = 1 token
"hello" "world" = 3 tokens (because space counts as 1 token)
I created main.cpp and I added "scanQuotesAsString" code to 3 modules given:
scanner.cpp
scanner.h
scanpriv.h
Right now, "hello world" scans a 2 tokens, not skipping the space. If I add (or skipspace, then regular input such as |hello world| without quotes skips spaces as well.
I think my issue is in scanner.cpp, where the last couple functions are:
/*
* Private method: scanToEndOfIdentifier
* Usage: finish = scanToEndOfIdentifier();
* ----------------------------------------
* This function advances the position of the scanner until it
* reaches the end of a sequence of letters or digits that make
* up an identifier. The return value is the index of the last
* character in the identifier; the value of the stored index
* cp is the first character after that.
*/
int Scanner::scanToEndOfIdentifier() {
while (cp < len && isalnum(buffer[cp])) {
if ((stringOption == ScanQuotesAsStrings) && (buffer[cp] == '"'))
break;
cp++;
}
return cp - 1;
}
/* Private functions */
/*
* Private method: scanQuotedString
* Usage: scanQuotedString();
* -------------------
* This function advances the position of the scanner until the
* current character is a double quotation mark
*/
void Scanner::scanQuotedString() {
while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"'))){
cp++;
}
Here is main.cc
#include "genlib.h"
#include "simpio.h"
#include "scanner.h"
#include <iostream>
/* Private function prototypes */
int CountTokens(string str);
int main() {
cout << "Please enter a sentence: ";
string str = GetLine();
int num = CountTokens(str);
cout << "You entered " << num << " tokens." << endl;
return 0;
}
int CountTokens(string str) {
int count = 0;
Scanner scanner; // create new scanner object
scanner.setInput(str); // initialize the input to be scanned
//scanner.setSpaceOption(Scanner::PreserveSpaces);
scanner.setStringOption(Scanner::ScanQuotesAsStrings);
while (scanner.hasMoreTokens()) { // read tokens from the scanner
scanner.nextToken();
count++;
}
return count;
}
Here's scanner.cpp
/*
* File: scanner.cpp
* -----------------
* Implementation for the simplified Scanner class.
*/
#include "genlib.h"
#include "scanner.h"
#include <cctype>
#include <iostream>
/*
* The details of the representation are inaccessible to the client,
* but consist of the following fields:
*
* buffer -- String passed to setInput
* len -- Length of buffer, saved for efficiency
* cp -- Current character position in the buffer
* spaceOption -- Setting of the space option extension
*/
Scanner::Scanner() {
buffer = "";
spaceOption = PreserveSpaces;
}
Scanner::~Scanner() {
/* Empty */
}
void Scanner::setInput(string str) {
buffer = str;
len = buffer.length();
cp = 0;
}
/*
* Implementation notes: nextToken
* -------------------------------
* The code for nextToken follows from the definition of a token.
*/
string Scanner::nextToken() {
if (cp == -1) {
Error("setInput has not been called");
}
if (stringOption == ScanQuotesAsStrings) scanQuotedString();
if (spaceOption == IgnoreSpaces) skipSpaces();
int start = cp;
if (start >= len) return "";
if (isalnum(buffer[cp])) {
int finish = scanToEndOfIdentifier();
return buffer.substr(start, finish - start + 1);
}
cp++;
return buffer.substr(start, 1);
}
bool Scanner::hasMoreTokens() {
if (cp == -1) {
Error("setInput has not been called");
}
if (stringOption == ScanQuotesAsStrings) scanQuotedString();
if (spaceOption == IgnoreSpaces) skipSpaces();
return (cp < len);
}
void Scanner::setSpaceOption(spaceOptionT option) {
spaceOption = option;
}
Scanner::spaceOptionT Scanner::getSpaceOption() {
return spaceOption;
}
void Scanner::setStringOption(stringOptionT option) {
stringOption = option;
}
Scanner::stringOptionT Scanner::getStringOption() {
return stringOption;
}
/* Private functions */
/*
* Private method: skipSpaces
* Usage: skipSpaces();
* -------------------
* This function advances the position of the scanner until the
* current character is not a whitespace character.
*/
void Scanner::skipSpaces() {
while (cp < len && isspace(buffer[cp])) {
cp++;
}
}
/*
* Private method: scanToEndOfIdentifier
* Usage: finish = scanToEndOfIdentifier();
* ----------------------------------------
* This function advances the position of the scanner until it
* reaches the end of a sequence of letters or digits that make
* up an identifier. The return value is the index of the last
* character in the identifier; the value of the stored index
* cp is the first character after that.
*/
int Scanner::scanToEndOfIdentifier() {
while (cp < len && isalnum(buffer[cp])) {
if ((stringOption == ScanQuotesAsStrings) && (buffer[cp] == '"'))
break;
cp++;
}
return cp - 1;
}
/* Private functions */
/*
* Private method: scanQuotedString
* Usage: scanQuotedString();
* -------------------
* This function advances the position of the scanner until the
* current character is a double quotation mark
*/
void Scanner::scanQuotedString() {
while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"'))){
cp++;
}
scanner.h
/*
* File: scanner.h
* ---------------
* This file is the interface for a class that facilitates dividing
* a string into logical units called "tokens", which are either
*
* 1. Strings of consecutive letters and digits representing words
* 2. One-character strings representing punctuation or separators
*
* To use this class, you must first create an instance of a
* Scanner object by declaring
*
* Scanner scanner;
*
* You initialize the scanner's input stream by calling
*
* scanner.setInput(str);
*
* where str is the string from which tokens should be read.
* Once you have done so, you can then retrieve the next token
* by making the following call:
*
* token = scanner.nextToken();
*
* To determine whether any tokens remain to be read, you can call
* the predicate method scanner.hasMoreTokens(). The nextToken
* method returns the empty string after the last token is read.
*
* The following code fragment serves as an idiom for processing
* each token in the string inputString:
*
* Scanner scanner;
* scanner.setInput(inputString);
* while (scanner.hasMoreTokens()) {
* string token = scanner.nextToken();
* . . . process the token . . .
* }
*
* This version of the Scanner class includes an option for skipping
* whitespace characters, which is described in the comments for the
* setSpaceOption method.
*/
#ifndef _scanner_h
#define _scanner_h
#include "genlib.h"
/*
* Class: Scanner
* --------------
* This class is used to represent a single instance of a scanner.
*/
class Scanner {
public:
/*
* Constructor: Scanner
* Usage: Scanner scanner;
* -----------------------
* The constructor initializes a new scanner object. The scanner
* starts empty, with no input to scan.
*/
Scanner();
/*
* Destructor: ~Scanner
* Usage: usually implicit
* -----------------------
* The destructor deallocates any memory associated with this scanner.
*/
~Scanner();
/*
* Method: setInput
* Usage: scanner.setInput(str);
* -----------------------------
* This method configures this scanner to start extracting
* tokens from the input string str. Any previous input string is
* discarded.
*/
void setInput(string str);
/*
* Method: nextToken
* Usage: token = scanner.nextToken();
* -----------------------------------
* This method returns the next token from this scanner. If
* nextToken is called when no tokens are available, it returns the
* empty string.
*/
string nextToken();
/*
* Method: hasMoreTokens
* Usage: if (scanner.hasMoreTokens()) . . .
* ------------------------------------------
* This method returns true as long as there are additional
* tokens for this scanner to read.
*/
bool hasMoreTokens();
/*
* Methods: setSpaceOption, getSpaceOption
* Usage: scanner.setSpaceOption(option);
* option = scanner.getSpaceOption();
* ------------------------------------------
* This method controls whether this scanner
* ignores whitespace characters or treats them as valid tokens.
* By default, the nextToken function treats whitespace characters,
* such as spaces and tabs, just like any other punctuation mark.
* If, however, you call
*
* scanner.setSpaceOption(Scanner::IgnoreSpaces);
*
* the scanner will skip over any white space before reading a
* token. You can restore the original behavior by calling
*
* scanner.setSpaceOption(Scanner::PreserveSpaces);
*
* The getSpaceOption function returns the current setting
* of this option.
*/
enum spaceOptionT { PreserveSpaces, IgnoreSpaces };
void setSpaceOption(spaceOptionT option);
spaceOptionT getSpaceOption();
/*
* Methods: setStringOption, getStringOption
* Usage: scanner.setStringOption(option);
* option = scanner.getStringOption();
* --------------------------------------------------
* This method controls how the scanner reads double quotation marks
* as input. The default is set to treat quotes just like any other
* punctuation character:
* scanner.setStringOption(Scanner::ScanQuotesAsPunctuation);
*
* Otherwise, the option:
* scanner.setStringOption(Scanner::ScanQuotesAsStrings);
*
* the token starting with a quotation mark will be scanned until
* another quotation mark is found (closing quotation). Therefore
* the entire string within the quotation, including both quotation
* marks counts as 1 token.
*/
enum stringOptionT { ScanQuotesAsPunctuation, ScanQuotesAsStrings };
void setStringOption(stringOptionT option);
stringOptionT getStringOption();
private:
#include "scanpriv.h"
};
#endif
** and finally scanpriv.h **
/*
* File: scanpriv.h
* ----------------
* This file contains the private data for the simplified version
* of the Scanner class.
*/
/* Instance variables */
string buffer; /* The string containing the tokens */
int len; /* The buffer length, for efficiency */
int cp; /* The current index in the buffer */
spaceOptionT spaceOption; /* Setting of the space option */
stringOptionT stringOption;
/* Private method prototypes */
void skipSpaces();
int scanToEndOfIdentifier();
void scanQuotedString();
To long to read.
Two ways of parsing quoted text:
0) State
A simple switch that tells whether you are in quotes right now, and which activates some special quotation handling. This would basically be equivalent to #1), just inline.
1) Sub-Rule in Recursive Descent Scanner
Put the state away and write a separate rule for scanning quoted text. The code would actually be quite simple (C++ inspired p-code):
// assume we are one behind the opening quotation mark
for (c : text) {
if (is_escape (*c)) { // to support stuff like "foo's name is \"bar\""
p = peek(c);
if (!is_valid_escape_character (peek (c))) error;
else {
make the peeked character (*p) part of the result;
++c;
}
}
else if (is_quotation_mark (*c))
{
return the result; // we approached the end of the string
}
else if (!is_valid_character (*c))
{
error; // maybe you want to forbid literal control characters
}
else
{
make *c part of the result
}
}
error; // reached end of input before closing quotation mark
If you do not want so support escape characters, the code gets simpler:
// assume we are one behind the opening quotation mark
for (c : text) {
if (is_quotation_mark (*c))
return the result;
else if (!is_valid_character (*c))
error;
else
make *c part of the result
}
error; // reached end of input before closing quotation mark
You should not omit the check whether its an invalid character, as this would invite users to exploit your code and possibly make use of undefined behavior of your program.
From a quick glance over the code: If you are in ScanQuotesAsStrings mode, you expect no other tokens than quoted strings; rather, the difference should be that when you see a token that begins with '"', you then go to a separate sub-scanner.
In pseudocode (using the C++ "end iterator is one-past-the-end" idiom):
current_token.begin = cursor;
current_token.end = current_token.begin + 1;
if(scan_quotes_as_strings && *current_token.begin == '"') {
while(*current_token.end && *current_token.end != '"')
++current_token.end;
return;
}
while(*current_token.end && *current_token.end != ' ')
++current_token.end;
You can combine these two loops to a single one by introducing a state variable rather than expressing the scanner state with different code paths.
Also,
while ((cp < len && (buffer[cp] == '"')) || (cp < len && (buffer[cp] == '"'))) ...
just looks fishy.