Strings from Excel to utf-8 mysql - c++

I am writing some software that takes rows from an XLS file and inserts them into a database.
In OpenOffice, a cell looks like this :
Brunner Straße, Parzelle
I am using the ExcelFormat library from CodeProject.
int type = cell->Type();
cout << "Cell contains " << type << endl;
const char* cellCharPtr = cell->GetString();
if (cellCharPtr != 0) {
value.assign(cellCharPtr);
cout << "normal string -> " << value << endl;
}
The string when fetched with the library, is returned as a char* (so cell->Type() returns STRING, not WSTRING) and now looks like this (on the console) :
normal string -> Brunner Stra�e, Parzelle
hex string -> 42 72 75 6e 6e 65 72 20 53 74 72 61 ffffffdf 65 2c 20 50 61 72 7a 65 6c 6c 65
I insert it into the database using the mysql cpp connector like so :
prep_stmt = con -> prepareStatement ("INSERT INTO "
+ tablename
+ "(crdate, jobid, imprownum, impid, impname, imppostcode, impcity, impstreet, imprest, imperror, imperrorstate)"
+ " VALUES(?,?,?,?,?,?,?,?,?,?,?)");
<...snip...>
prep_stmt->setString(8,vals["street"]);
<...snip...>
prep_stmt->execute();
Having inserted it into the database, which has a utf8_general_ci collation, it looks like this :
Brunner Stra
which is annoying.
How do I make sure that whatever locale the file is in gets transformed to utf-8 when the string is retrieved from the xls file?
This is going to be running as a backend for a web service, where clients can upload their own excel files, so "Change the encoding of the file in Libre Office" can't work, I am afraid.

Your input seems to be encoded in latin1, so you need to set the mysql "connection charset" to latin1.
I'm not familiar with the API you are using to connect to MySQL. In other APIs you'd add charset=latin1 to the connection URL or call an API function to set the connection encoding.
Alternatively you can recode the input before feeding it to MySQL.

Related

Printing elements of a tuple

I am trying to print the elements of a tuple returned by a function where I am comparing the elements of a vector of addresses to those in a database. The fields are: 32-bit int representing the address, int for prefix matching, string containing ASN, string containing matching address, string containing the original address being queried.
for (auto itr = IPs.begin(); itr != IPs.end(); itr++) {
tuple<int,int,string,string,string> entry = Compare(*itr, database);
string out = get<3>(entry) + "/" + to_string(get<1>(entry)) + " " + get<2>(entry) + " " + get<4>(entry) + "\n";
cout << out;
}
I want each line of the output to look like this:
"{prefix}/{# bits of prefix} {ASN} {address}\n"
However, the output looks like this:
12.105.69.1528 15314
12.125.142.190 6402
57.0.208.2450 6085
208.148.84.30 4293
208.148.84.16 4293
208.152.160.797 5003
192.65.205.2509 5400
194.191.154.806 2686
199.14.71.79 1239
199.14.70.79 1239
The expected output is:
12.105.69.144/28 15314 12.105.69.152
12.125.142.16/30 6402 12.125.142.19
57.0.208.244/30 6085 57.0.208.245
208.148.84.0/30 4293 208.148.84.3
208.148.84.0/24 4293 208.148.84.16
208.152.160.64/27 5003 208.152.160.79
192.65.205.248/29 5400 192.65.205.250
194.191.154.64/26 2686 194.191.154.80
199.14.71.0/24 1239 199.14.71.79
199.14.70.0/24 1239 199.14.70.79
The part that confuses me the most is the fact that when I print each element on separate lines by replacing each separator with line breaks, it prints the elements correctly:
12.105.69.144
28
15314
12.105.69.152
12.125.142.16
30
6402
12.125.142.19
57.0.208.244
30
6085
57.0.208.245
208.148.84.0
30
4293
208.148.84.3
208.148.84.0
24
4293
208.148.84.16
208.152.160.64
27
5003
208.152.160.79
192.65.205.248
29
5400
192.65.205.250
194.191.154.64
26
2686
194.191.154.80
199.14.71.0
24
1239
199.14.71.79
199.14.70.0
24
1239
199.14.70.79
I suppose that I could just write another function that formats the line breaks into the correct format afterwards, but I am curious about what is causing this. Any ideas?
Could you provide a little more code, so it can be debugged to precisely track the problem?
I think tuple and get are used correctly.
I guess the problem is in the content of strings or at least in the string returned by `get<2>(entry).
Here is a little example which shows what might be wrong
std::string aa = "AAAAA\r"; //"\r" is extra character in aa string
std::string bb = "bbb";
std::cout << aa + " " + bb; //output is " bbbA" not "AAAAA bbb"
The problem obviously doesn't occur when each strings is printed separately in each line.
Double check if string returned by get<X> doesn't contain any special characters or contain OSX end of line mixed with Linux or Windows end of line

Getting full name and values from string with regex and c++

I have a project where I am reading data from a text file in c++ which contains a person's name and up to 4 numerical numbers like this. (each line has an entry)
Dave Light 89 71 91 89
Hua Tran Du 81 79 80
I am wondering if regex would be an efficient way of splitting the name and numerical values or if I should find an alternative method.
I would also like to be able to pick up any errors in the text file when reading each entry such as a letter instead of a number as if an entry like this was found.
Andrew Van Den J 88 95 85
You should better use a separator instead of space. The separator could be :, |, ^ or anything that cannot be part of your data. With this approach, your data should be stored as:
Dave Light:89:71:91:89
Hua Tran Du:81:79:80
And then you can use find, find_first_of, strchr or strstr or any other searching (and re-searching) to find relevant data.
This non-regex solution:
std::string str = "Dave Light 89 71 91 89";
std::size_t firstDig = str.find_first_of("0123456789");
std::string str1 = str.substr (0,firstDig);
std::string str2 = str.substr (firstDig);
would give you the letter part in str1 and the number part in str2.
Check this code at ideone.com.
It sounds like it's something like this you want...(?) I'm not quite sure what kind of errors you mean to pick. As paxdiablo pointed out, a name could be quite complex, so getting the letter part probably would be the safest.
Try this code.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main(){
std::vector<std::string> data {"Dave Light 89 71 91 ","Hua Tran Du 81 79 80","zyx 1 2 3 4","zyx 1 2"};
std::regex pat {R"((^[A-Za-z\s]*)(\d+)\s*(\d+)\s*(\d+)(\s*)$)"};
for(auto& line : data) {
std::cout<<line<<std::endl;
std::smatch matches; // matched strings go here
if (regex_search(line, matches, pat)) {
//std::cout<<"size:"<<matches.size()<<std::endl;
if (matches.size()==6)
std::cout<<"Name:"<<matches[1].str()<<"\t"<<"data1:"<<matches[2].str()<<"\tdata2:"<<matches[3].str()<<"\tdata3:"<<matches[4].str()<<std::endl;
}
}
}
With regex number of lines code reduced greatly. Main trick in regex is using right pattern.
Hope this will help you.

display list with new line in a cell

I use JXLS with templates to generate excels files.
I worked pretty good.
However, I would like to know if there is a way to display a list of String in a cell, with new line after each element, instead of display a new cell for each element.
Example : i have a list of employee
${employees.name} would give me :
employee 01
employee 02
employee 03
instead of :
employee 01
employee 02
employee 03
You should create a StringBuilder with "\n" (break line) in your string:
//In your Object
public String getAsString() {
StringBuilder sb = new StringBuilder();
sb.append("line1").append("\n").append("line2").append("\n");
return sb.toString();
}
//In your Sheet
${object.asString}

Regex to detect ASCII art on a single line.

Basically I want to find ASCII Art on one line. For me this is any 2 characters that are not alpha numeric ignoring whitespace. So a line might look like :
This is a !# Test of --> ASCII art detection ### <--
So the matches I should get are :
!#
-->
###
<--
I came up with this which still selects spaces :(
\b\W{2,}
Im using the following website for testing :
http://gskinner.com/RegExr/
Thanks for the help its much appreciated!!
I'd suggest something like this:
[^\w\s]{2,}
This will match any sequence of two or more characters that are not word characters (which include alphanumeric characters and underscores) or whitespace characters.
Demonstration
If you would also like to match underscores as part of your 'ASCII art', you'd have to be more specific:
[^a-zA-Z0-9\s]{2,}
Demonstration
I think this
((?=[\x21-\x7e])[\W_]){2,}
is probably equavalent to this
[[:punct:]]{2,}
Using POSIX, the supported punctuation is:
(to add more, just add it to the class [[:punct:]<add here>]{2,}
33 = !
34 = "
35 = #
36 = $
37 = %
38 = &
39 = '
40 = (
41 = )
42 = *
43 = +
44 = ,
45 = -
46 = .
47 = /
58 = :
59 = ;
60 = <
61 = =
62 = >
63 = ?
64 = #
91 = [
92 = \
93 = ]
94 = ^
95 = _
96 = `
123 = {
124 = |
125 = }
126 = ~

How to extract a string that is present between two brackets?

For example if the string is:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
The output should be:
20 BB EC 45 40 C8 97 20 84 8B 10
int main()
{
char input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char output[500];
// what to write here so that i can get the desired output as:
// output = "20 BB EC 45 40 C8 97 20 84 8B 10"
return 0;
}
In C, you could do this with a scanset conversion (though it's a bit RE-like, so the syntax gets a bit strange):
sscanf(input, "[%*[^]]][%[^]]]", second_string);
In case you're wondering how that works, the first [ matches an open bracket literally. Then you have a scanset, which looks like %[allowed_chars] or %[^not_allowed_chars]. In this case, you're scanning up to the first ], so it's %[^]]. In the first one, we have a * between the % and the rest of the conversion specification, which means sscanf will try to match that pattern, but ignore it -- not assign the result to anything. That's followed by a ] that gets matched literally.
Then we repeat essentially the same thing over again, but without the *, so the second data that's matched by this conversion gets assigned to second_string.
With the typo fixed and a bit of extra code added to skip over the initial XYZ ::, working (tested) code looks like this:
#include <stdio.h>
int main() {
char *input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char second_string[64];
sscanf(input, "%*[^[][%*[^]]][%[^]]]", second_string);
printf("content: %s\n", second_string);
return 0;
}
Just find the second [ and start extracting (or just printing) until next ]....
You can use string::substr if you are willing to convert to std::string
If you don't know the location of brackets, you can use string::find_last_of for the last bracket and again string::find_last_of to find the open bracket.
Well, say, your file looks like this:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
XYZ ::[1][Maybe some other text]
XYZ ::[1][Some numbers maybe: 123 98345 123 9-834 ]
XYZ ::[1][blah-blah-blah]
The code that will extract the data will look something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//opening the file to read from
std::ifstream file( "in.txt" );
if( !file.is_open() )
{
cout << "Cannot open the file";
return -1;
}
std::string in, out;
int blockNumber = 1;//Which bracket block we are looking for. We are currently looking for the second one.
while( getline( file, in ) )
{
int n = 0;//Variable for storing index in the string (where our target text starts)
int i = 0;//Counter for [] blocks we have encountered.
while( i <= blockNumber )
{
//What we are doing here is searching for the position of [ symbol, starting
//from the n + 1'st symbol of the string.
n = in.find_first_of('[', n + 1);
i++;
}
//Getting our data and printing it.
out = in.substr( n + 1, ( in.find_first_of(']', n) - n - 1) );
std::cout << out << std::endl;
}
return 0;
}
The output after executing this will be:
20 BB EC 45 40 C8 97 20 84 8B 10
Maybe some other text
Some numbers maybe: 123 98345 123 9-834
blah-blah-blah
The simplest solution is something along the lines of:
std::string
match( std::string const& input )
{
static boost::regex const matcher( ".*\\[[^]]*\\]\\[(.*)\\]" );
boost::smatch matched;
return regex_match( input, matched, matcher )
? matched[1]
: std::string();
}
The regular expression looks a bit complicated because you need to match
meta-characters, and because the compiler I use doesn't support raw
strings yet. (With raw strings, I think the expression would be
R"^(.*\[[^]]\]\[(.*)\])^". But I can't verify that.)
This returns an empty string in case there is no match; if you're sure
about the format, you might prefer to throw an exception. You can also
extend it to do as much error checking as necessary: in general, the
more you validate a text input, the better it is, but you didn't give
precise enough information about what was legal for me to fill it out
completely. (For your example string, for example, you might replace
the ".*" at the beginning of the regular expression with
"\\u{3}\\s*::": three upper case characters followed by zero or more
whitespace, then two ':'. Or the first [] group might be
"\\[\\d\\]", if you're certain it's always a single digit.
This could work for you in a very specific sense:
std::string str(input);
std::string output(input.find_last_of('['), input.find_last_of(']'));
out = output.c_str();
The syntax isnt quite correct so you will need to look that up. You probably need to define your question a little better as well as this will only work if you want the brcketed string at the end.
Using string library in C. I'll give a code snippet that process a single linewhich can be used in a loop that reads the file line by line. NOTE: string.h should be included
int length = strlen( input );
char* output = 0;
// Search
char* firstBr = strchr( input, '[' );
if( 0 != firstBr++ ) // check for null pointer
{
char* secondBr = strchr( firstBr, '[' );
// we don't need '['
if( 0 != secondBr++ )
{
int nOutLen = strlen( secondBr ) - 1;
if( 0 < nOutLen )
{
output = new char[nOutLen+1];
strncpy( output, secondBr, nOutLen );
output[ nOutLen ] = '\0';
}
}
}
if( 0 != output )
{
cout << output;
delete[] output;
output = 0;
}
else
{
cout << "Error!";
}
You could use this regex to get what is inside "<" and ">":
// Regex: "<%999[^>]>" (Max of 999 Bytes)
int n1 = sscanf(source, "<%999[^>]>", dest);