Norwegian characters in C++ strings

Norwegian characters in C++ strings - c++

I'm working with strings in C++, and have a question about how norwegian characters are treated.
If I run the following code;
int main()
{
string norwegian = "BLÅBÆRSYLTETØY";
for (auto &c : norwegian)
cout << c << " => " << static_cast<int>(c) << endl;
return 0;
}
the output at cmd becomes:
B => 66
L => 76
┼ => -59
B => 66
ã => -58
R => 82
S => 83
Y => 89
L => 76
T => 84
E => 69
T => 84
Ï => -40
Y => 89
Notice that the three norwegian characters are not printed correctly, and that the ASCII value is negative.
Is there any way to treat the string so that it uses the correct charactermap?
EDIT
The solution is to change the codepage from ANSI to UTF-7, which can be done by adding this before the code that does stringhandling;
system("chcp 65000");

Related

Extract number from a text after symbol "X"

I have the following text in a column, where I need to extract number next to second "X" or "x",
in the below text, it is 54.
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY
Some other sample texts:
21sOE X 12sFL 56 X 36 63" PLAIN # Result must be : 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48

Given:
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY # Result: 54
21sOE X 12sFL 56 X 36 63" PLAIN # Result: 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Use:
[Xx]\s?(\d+)(?:.(?![Xx]))*$
Demo and Explanation:
https://regex101.com/r/KshMUE/1

You didn't state which tool/language this is using, so it's hard to know for sure what to suggest.
However, if possible, I would consider splitting the string on the letter "x" (or "X") as this makes the regex part much easier to follow. For example, something like:
input = '40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY'
input.split(/x/i)[2][/\d+/]
By doing this split, we first extract only the desired section of the string (in this case, ' 54 54" AWM/C129-DOBY'), so the regex (/\d+/) becomes trivial.

Try this:
(?i)(?<=x[^x]{1,100}x.)\d+
(?i): case-insensitive
(?<=: start of positive look-behind
x[^x]{1,100}x.: an xfollowed by up to 100 any characters except x, followed by x and any one single character
): end of look-behind
\\d+: one or more digits

Match EOL character?

I'm trying to capture the commands from a string of RTTTL commands like this:
2a4, 2e, 2d#, 2b4, 2a4, 2c, 2d, 2a#4, 2e., e, 1f4, 1a4, 1d#, 2e., d, 2c., b4, 1a4, 1p, 2a4, 2e, 2d#, 2b4, 2a4, 2c, 2d, 2a#4, 2e., e, 1f4, 1a4, 1d#, 2e., d, 2c., b4, 1a4
The regex I'm using is (\S+),|$ with global and multiline on, as I read that $ matches EOL when multiline mode is on, however this does not happen, and thus I cannot capture the last command 1a4, which ends the line. All the other commands are captured from the group.
What's the regex I should be using to capture the last command?

Just add a lookahead or non-capturing group like below. And get the string you want from group index 1.
(\S+)(?:,|$)
DEMO
OR
(\S+)(?=,|$)
DEMO
You don't need to have a capturing group also when using lookahead.
\S+(?=,|$)
(?=,|$) Positive lookahead asserts that the match must be followed by a , or end of the line anchor. \S+ matches one or more non-space characters.

another solution
$a = " 2a4, 2e, 2d#, 2b4, 2a4, 2c, 2d, 2a#4, 2e., e, 1f4, 1a4, 1d#, 2e., d, 2c., b4, 1a4, 1p, 2a4, 2e, 2d#, 2b4, 2a4, 2c, 2d, 2a#4, 2e., e, 1f4, 1a4, 1d#, 2e., d, 2c., b4, 1a4";
$r=explode(",",preg_replace("/\\s+/","",$a));
var_dump($r);
output:
array (size=37)
0 => string '2a4' (length=3)
1 => string '2e' (length=2)
2 => string '2d#' (length=3)
3 => string '2b4' (length=3)
4 => string '2a4' (length=3)
5 => string '2c' (length=2)
6 => string '2d' (length=2)
7 => string '2a#4' (length=4)
8 => string '2e.' (length=3)
9 => string 'e' (length=1)
10 => string '1f4' (length=3)
11 => string '1a4' (length=3)
12 => string '1d#' (length=3)
13 => string '2e.' (length=3)
14 => string 'd' (length=1)
15 => string '2c.' (length=3)
16 => string 'b4' (length=2)
17 => string '1a4' (length=3)
18 => string '1p' (length=2)
19 => string '2a4' (length=3)
20 => string '2e' (length=2)
21 => string '2d#' (length=3)
22 => string '2b4' (length=3)
23 => string '2a4' (length=3)
24 => string '2c' (length=2)
25 => string '2d' (length=2)
26 => string '2a#4' (length=4)
27 => string '2e.' (length=3)
28 => string 'e' (length=1)
29 => string '1f4' (length=3)
30 => string '1a4' (length=3)
31 => string '1d#' (length=3)
32 => string '2e.' (length=3)
33 => string 'd' (length=1)
34 => string '2c.' (length=3)
35 => string 'b4' (length=2)
36 => string '1a4' (length=3)

Regex to detect ASCII art on a single line.

Basically I want to find ASCII Art on one line. For me this is any 2 characters that are not alpha numeric ignoring whitespace. So a line might look like :
This is a !# Test of --> ASCII art detection ### <--
So the matches I should get are :
!#
-->
###
<--
I came up with this which still selects spaces :(
\b\W{2,}
Im using the following website for testing :
http://gskinner.com/RegExr/
Thanks for the help its much appreciated!!

I'd suggest something like this:
[^\w\s]{2,}
This will match any sequence of two or more characters that are not word characters (which include alphanumeric characters and underscores) or whitespace characters.
Demonstration
If you would also like to match underscores as part of your 'ASCII art', you'd have to be more specific:
[^a-zA-Z0-9\s]{2,}
Demonstration

I think this
((?=[\x21-\x7e])[\W_]){2,}
is probably equavalent to this
[[:punct:]]{2,}
Using POSIX, the supported punctuation is:
(to add more, just add it to the class [[:punct:]<add here>]{2,}
33 = !
34 = "
35 = #
36 = $
37 = %
38 = &
39 = '
40 = (
41 = )
42 = *
43 = +
44 = ,
45 = -
46 = .
47 = /
58 = :
59 = ;
60 = <
61 = =
62 = >
63 = ?
64 = #
91 = [
92 = \
93 = ]
94 = ^
95 = _
96 = `
123 = {
124 = |
125 = }
126 = ~

Strings from Excel to utf-8 mysql

I am writing some software that takes rows from an XLS file and inserts them into a database.
In OpenOffice, a cell looks like this :
Brunner Straße, Parzelle
I am using the ExcelFormat library from CodeProject.
int type = cell->Type();
cout << "Cell contains " << type << endl;
const char* cellCharPtr = cell->GetString();
if (cellCharPtr != 0) {
value.assign(cellCharPtr);
cout << "normal string -> " << value << endl;
}
The string when fetched with the library, is returned as a char* (so cell->Type() returns STRING, not WSTRING) and now looks like this (on the console) :
normal string -> Brunner Stra�e, Parzelle
hex string -> 42 72 75 6e 6e 65 72 20 53 74 72 61 ffffffdf 65 2c 20 50 61 72 7a 65 6c 6c 65
I insert it into the database using the mysql cpp connector like so :
prep_stmt = con -> prepareStatement ("INSERT INTO "
+ tablename
+ "(crdate, jobid, imprownum, impid, impname, imppostcode, impcity, impstreet, imprest, imperror, imperrorstate)"
+ " VALUES(?,?,?,?,?,?,?,?,?,?,?)");
<...snip...>
prep_stmt->setString(8,vals["street"]);
<...snip...>
prep_stmt->execute();
Having inserted it into the database, which has a utf8_general_ci collation, it looks like this :
Brunner Stra
which is annoying.
How do I make sure that whatever locale the file is in gets transformed to utf-8 when the string is retrieved from the xls file?
This is going to be running as a backend for a web service, where clients can upload their own excel files, so "Change the encoding of the file in Libre Office" can't work, I am afraid.

Your input seems to be encoded in latin1, so you need to set the mysql "connection charset" to latin1.
I'm not familiar with the API you are using to connect to MySQL. In other APIs you'd add charset=latin1 to the connection URL or call an API function to set the connection encoding.
Alternatively you can recode the input before feeding it to MySQL.

How to extract a string that is present between two brackets?

For example if the string is:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
The output should be:
20 BB EC 45 40 C8 97 20 84 8B 10
int main()
{
char input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char output[500];
// what to write here so that i can get the desired output as:
// output = "20 BB EC 45 40 C8 97 20 84 8B 10"
return 0;
}

In C, you could do this with a scanset conversion (though it's a bit RE-like, so the syntax gets a bit strange):
sscanf(input, "[%*[^]]][%[^]]]", second_string);
In case you're wondering how that works, the first [ matches an open bracket literally. Then you have a scanset, which looks like %[allowed_chars] or %[^not_allowed_chars]. In this case, you're scanning up to the first ], so it's %[^]]. In the first one, we have a * between the % and the rest of the conversion specification, which means sscanf will try to match that pattern, but ignore it -- not assign the result to anything. That's followed by a ] that gets matched literally.
Then we repeat essentially the same thing over again, but without the *, so the second data that's matched by this conversion gets assigned to second_string.
With the typo fixed and a bit of extra code added to skip over the initial XYZ ::, working (tested) code looks like this:
#include <stdio.h>
int main() {
char *input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char second_string[64];
sscanf(input, "%*[^[][%*[^]]][%[^]]]", second_string);
printf("content: %s\n", second_string);
return 0;
}

Just find the second [ and start extracting (or just printing) until next ]....

You can use string::substr if you are willing to convert to std::string
If you don't know the location of brackets, you can use string::find_last_of for the last bracket and again string::find_last_of to find the open bracket.

Well, say, your file looks like this:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
XYZ ::[1][Maybe some other text]
XYZ ::[1][Some numbers maybe: 123 98345 123 9-834 ]
XYZ ::[1][blah-blah-blah]
The code that will extract the data will look something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//opening the file to read from
std::ifstream file( "in.txt" );
if( !file.is_open() )
{
cout << "Cannot open the file";
return -1;
}
std::string in, out;
int blockNumber = 1;//Which bracket block we are looking for. We are currently looking for the second one.
while( getline( file, in ) )
{
int n = 0;//Variable for storing index in the string (where our target text starts)
int i = 0;//Counter for [] blocks we have encountered.
while( i <= blockNumber )
{
//What we are doing here is searching for the position of [ symbol, starting
//from the n + 1'st symbol of the string.
n = in.find_first_of('[', n + 1);
i++;
}
//Getting our data and printing it.
out = in.substr( n + 1, ( in.find_first_of(']', n) - n - 1) );
std::cout << out << std::endl;
}
return 0;
}
The output after executing this will be:
20 BB EC 45 40 C8 97 20 84 8B 10
Maybe some other text
Some numbers maybe: 123 98345 123 9-834
blah-blah-blah

The simplest solution is something along the lines of:
std::string
match( std::string const& input )
{
static boost::regex const matcher( ".*\\[[^]]*\\]\\[(.*)\\]" );
boost::smatch matched;
return regex_match( input, matched, matcher )
? matched[1]
: std::string();
}
The regular expression looks a bit complicated because you need to match
meta-characters, and because the compiler I use doesn't support raw
strings yet. (With raw strings, I think the expression would be
R"^(.*\[[^]]\]\[(.*)\])^". But I can't verify that.)
This returns an empty string in case there is no match; if you're sure
about the format, you might prefer to throw an exception. You can also
extend it to do as much error checking as necessary: in general, the
more you validate a text input, the better it is, but you didn't give
precise enough information about what was legal for me to fill it out
completely. (For your example string, for example, you might replace
the ".*" at the beginning of the regular expression with
"\\u{3}\\s*::": three upper case characters followed by zero or more
whitespace, then two ':'. Or the first [] group might be
"\\[\\d\\]", if you're certain it's always a single digit.

This could work for you in a very specific sense:
std::string str(input);
std::string output(input.find_last_of('['), input.find_last_of(']'));
out = output.c_str();
The syntax isnt quite correct so you will need to look that up. You probably need to define your question a little better as well as this will only work if you want the brcketed string at the end.

Using string library in C. I'll give a code snippet that process a single linewhich can be used in a loop that reads the file line by line. NOTE: string.h should be included
int length = strlen( input );
char* output = 0;
// Search
char* firstBr = strchr( input, '[' );
if( 0 != firstBr++ ) // check for null pointer
{
char* secondBr = strchr( firstBr, '[' );
// we don't need '['
if( 0 != secondBr++ )
{
int nOutLen = strlen( secondBr ) - 1;
if( 0 < nOutLen )
{
output = new char[nOutLen+1];
strncpy( output, secondBr, nOutLen );
output[ nOutLen ] = '\0';
}
}
}
if( 0 != output )
{
cout << output;
delete[] output;
output = 0;
}
else
{
cout << "Error!";
}

You could use this regex to get what is inside "<" and ">":
// Regex: "<%999[^>]>" (Max of 999 Bytes)
int n1 = sscanf(source, "<%999[^>]>", dest);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Norwegian characters in C++ strings - c++

Related

Extract number from a text after symbol "X"

Match EOL character?

Regex to detect ASCII art on a single line.

Strings from Excel to utf-8 mysql

How to extract a string that is present between two brackets?

Categories

Resources