Visual Studio Character Sets 'Not set' vs 'Multi byte character set'

Visual Studio Character Sets 'Not set' vs 'Multi byte character set' - c++

I've working with a legacy application and I'm trying to work out the difference between applications compiled with Multi byte character set and Not Set under the Character Set option.
I understand that compiling with Multi byte character set defines _MBCS which allows multi byte character set code pages to be used, and using Not set doesn't define _MBCS, in which case only single byte character set code pages are allowed.
In the case that Not Set is used, I'm assuming then that we can only use the single byte character set code pages found on this page: http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx
Therefore, am I correct in thinking that is Not Set is used, the application won't be able to encode and write or read far eastern languages since they are defined in double byte character set code pages (and of course Unicode)?
Following on from this, if Multi byte character set is defined, are both single and multi byte character set code pages available, or only multi byte character set code pages? I'm guessing it must be both for European languages to be supported.
Thanks,
Andy
Further Reading
The answers on these pages didn't answer my question, but helped in my understanding:
About the "Character set" option in visual studio 2010
Research
So, just as working research... With my locale set as Japanese
Effect on hard coded strings
char *foo = "Jap text: テスト";
wchar_t *bar = L"Jap text: テスト";
Compiling with Unicode
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Compiling with Multi byte character set
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Compiling with Not Set
*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2
Conclusion:
The character encoding doesn't have any effect on hard coded strings. Although defining chars as above seems to use the Locale defined codepage and wchar_t seems to use either UCS-2 or UTF-16.
Using encoded strings in W/A versions of Win32 APIs
So, using the following code:
char *foo = "C:\\Temp\\テスト\\テa.txt";
wchar_t *bar = L"C:\\Temp\\テスト\\テw.txt";
CreateFileA(bar, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
CreateFileW(foo, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
Compiling with Unicode
Result: Both files are created
Compiling with Multi byte character set
Result: Both files are created
Compiling with Not set
Result: Both files are created
Conclusion:
Both the A and W version of the API expect the same encoding regardless of the character set chosen. From this, perhaps we can assume that all the Character Set option does is switch between the version of the API. So the A version always expects strings in the encoding of the current code page and the W version always expects UTF-16 or UCS-2.
Opening files using W and A Win32 APIs
So using the following code:
char filea[MAX_PATH] = {0};
OPENFILENAMEA ofna = {0};
ofna.lStructSize = sizeof ( ofna );
ofna.hwndOwner = NULL ;
ofna.lpstrFile = filea ;
ofna.nMaxFile = MAX_PATH;
ofna.lpstrFilter = "All\0*.*\0Text\0*.TXT\0";
ofna.nFilterIndex =1;
ofna.lpstrFileTitle = NULL ;
ofna.nMaxFileTitle = 0 ;
ofna.lpstrInitialDir=NULL ;
ofna.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
wchar_t filew[MAX_PATH] = {0};
OPENFILENAMEW ofnw = {0};
ofnw.lStructSize = sizeof ( ofnw );
ofnw.hwndOwner = NULL ;
ofnw.lpstrFile = filew ;
ofnw.nMaxFile = MAX_PATH;
ofnw.lpstrFilter = L"All\0*.*\0Text\0*.TXT\0";
ofnw.nFilterIndex =1;
ofnw.lpstrFileTitle = NULL;
ofnw.nMaxFileTitle = 0 ;
ofnw.lpstrInitialDir=NULL ;
ofnw.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;
GetOpenFileNameA(&ofna);
GetOpenFileNameW(&ofnw);
and selecting either:
C:\Temp\テスト\テopenw.txt
C:\Temp\テスト\テopenw.txt
Yields:
When compiled with Unicode
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74
00 == UTF-16 or UCS-2
When compiled with Multi byte character set
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74
00 == UTF-16 or UCS-2
When compiled with Not Set
*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74
00 == UTF-16 or UCS-2
Conclusion:
Again, the Character Set setting doesn't have a bearing on the behaviour of the Win32 API. The A version always seems to return a string with the encoding of the active code page and the W one always returns UTF-16 or UCS-2. I can actually see this explained a bit in this great answer: https://stackoverflow.com/a/3299860/187100.
Ultimate Conculsion
Hans appears to be correct when he says that the define doesn't really have any magic to it, beyond changing the Win32 APIs to use either W or A. Therefore, I can't really see any difference between Not Set and Multi byte character set.

No, that's not really the way it works. The only thing that happens is that the macro gets defined, it doesn't otherwise have a magic effect on the compiler. It is very rare to actually write code that uses #ifdef _MBCS to test this macro.
You almost always leave it up to a helper function to make the conversion. Like WideCharToMultiByte(), OLE2A() or wctombs(). Which are conversion functions that always consider multi-byte encodings, as guided by the code page. _MBCS is an historical accident, relevant only 25+ years ago when multi-byte encodings were not common yet. Much like using a non-Unicode encoding is a historical artifact these days as well.

In the reference it is stated that:
By definition, the ASCII character set is a subset of all
multibyte-character sets. In many multibyte character sets, each
character in the range 0x00 – 0x7F is identical to the character that
has the same value in the ASCII character set. For example, in both
ASCII and MBCS character strings, the 1-byte NULL character ('\0') has
value 0x00 and indicates the terminating null character.
As you guessed, by enabling _MBCS Visual Studio also supports ASCII single character set.
In a second reference, single character set seems to be supported even if we enable _MBCS:
MBCS/Unicode portability: Using the Tchar.h header file, you can build
single-byte, MBCS, and Unicode applications from the same sources.
Tchar.h defines macros prefixed with _tcs , which map to str, _mbs, or
wcs functions, as appropriate. To build MBCS, define the symbol _MBCS.
To build Unicode, define the symbol _UNICODE. By default, _MBCS is
defined for MFC applications. For more information, see Generic-Text
Mappings in Tchar.h.

Related

Regex to match part of a hex

so I need to use regex to match a part of a hexadecimal string, but that part is random. Let me try to explain more:
So I have this hexa data:
70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78
I need to match only the f2 in that case. But that is not always the case. Each data will be different. The only thing that is always the same is the '00 00 00' part and the '78' at the end. All the rest is random.
I managed to make the following regex:
/(?=00 00 00).+?(?=78)/
The output is:
00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0
But I dont know how to build a regex to take only the 'f2' (reminder: not always is going to be f2)
Any thoughts?

Given the explanation in this comment, the regex that you need is:
(?<=00 00 00 [0-9a-f]{2} )[0-9a-f]{2}
Providing the first input string from the question, this regex matches f2 (no spaces around it).
Check it online.
How it works:
(?<= # start of a positive lookbehind
00 00 00 # match the exact string ("00 00 00 ")
[0-9a-f] # match one hex digit (lowercase only)
{2} # match the previous twice (i.e. two hex digits)
# there is a space after ")"
) # end of the lookbehind
[0-9a-f]{2} # match two hex digits
The positive lookbehind works like a non-capturing group but it is not part of the match. Basically it says that the matching part ([0-9a-f]{2}) matches only if it is preceded by a match of the lookbehind expression.
The matching part of the expression is [0-9a-f]{2} (i.e. two hex digits).
You need to add i or whatever flag uses the regex engine that you use to denote "ignore cases" (i.e. the a-f part of regex also match A-F). If you cannot (or do not want to) provide this flag you can put [0-9A-Fa-f] everywhere and it works.
If your regex engine does not support lookbehind you can get the same result using capturing groups:
00 00 00 [0-9a-f]{2} ([0-9a-f]{2})
Applied on the same input, this regex matches 00 00 00 20 f2 and its first (and only) capturing group matches f2.
Check it online.
Update
If it is important that the input string contains 78 somewhere after the matching part then add (?=(?: [0-9a-z]{2})* 78) to the first regex:
(?<=00 00 00 [0-9a-f]{2} )[0-9a-f]{2}(?=(?: [0-9a-z]{2})* 78)
(?= introduces a positive lookahead. It behaves similar to a lookbehind but must stay after the matching part of the reged and it is verified against the part of the string located after the matching part of the string.
(?: starts a non-capturing group.
The [0-9a-z]{2} followed or preceded by a space in the lookahead and lookbehind ensure that the entire matching string is composed only of 2 hex digit numbers separated by spaces. You can use .* instead but that will match anything, even if they do not follow the format of 2 hex digit numbers.
For the version without lookaheads or lookbehinds add (?: [0-9a-z]{2})* 78 at the end of the regex:
00 00 00 [0-9a-f]{2} ([0-9a-f]{2})(?: [0-9a-z]{2})* 78
The regex matches the entire string starting with 00 00 00 and ending with 78 and the first capturing group matches the second number after 00 00 00 (your target).

Is the f2 surrounded by asterisks?
Without asterisks:
00 00 00 [a-f0-9]+ (?<hexits>[a-f0-9]+).+78
With asterisks:
\*(?<hexits>[a-f0-9]+)\*

You can use the following regex to match the hexadecimal value after "00 00 00": /00 00 00 ([0-9A-Fa-f]{2})/. The value you want is in the capturing group, represented by \1.
Here is a demo:
import re
s = '70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78'
match = re.search(r'00 00 00 ([0-9A-Fa-f]{2})', s)
if match:
print(match.group(1))
The output will be:
f2

You don't really need a regex for that. Get the offset of 3 bytes of zero in a row and take the 4th one after it:
s = '70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78'
s2 = '01 02 03 00 00 00 05 06 07'
def locate(s):
data = bytes.fromhex(s)
offset = data.find(bytes([0,0,0]))
return data[offset + 4]
print(f'{locate(s):02X}')
print(f'{locate(s2):02X}')
Output:
F2
06
You could also extract the "f2" string directly from the string:
offset = s.index('00 00 00')
print(s[offset + 12 : offset + 14]) # 'f2'

Intel OpenVINO memory leaks with VS2017

I am using OpenVINO 2020.3.194, with Windows10 x64 and VS2017.I can run the Intel C++ examples, but when I use the inference in my application, I got lots of memory leaks at the exit. Here are some of them:
{2395} normal block at 0x00000260E6A3BCF0, 48 bytes long.
Data: <google.protobuf.> 67 6F 6F 67 6C 65 2E 70 72 6F 74 6F 62 75 66 2E
{2394} normal block at 0x00000260E5958350, 16 bytes long.
Data: < ` > 20 01 A4 E6 60 02 00 00 00 00 00 00 00 00 00 00
{2393} normal block at 0x00000260E6A40100, 88 bytes long.
Data: < ` ` > B0 E4 A2 E6 60 02 00 00 10 E4 A2 E6 60 02 00 00
{2388} normal block at 0x00000260E6A41FB0, 32 bytes long.
Data: <google.protobuf.> 67 6F 6F 67 6C 65 2E 70 72 6F 74 6F 62 75 66 2E
{2387} normal block at 0x00000260E5958170, 16 bytes long.
Data: < ` > 80 00 A4 E6 60 02 00 00 00 00 00 00 00 00 00 00
{2386} normal block at 0x00000260E6A40060, 88 bytes long.
Data: < w ` ` > A0 77 BC E5 60 02 00 00 20 09 A4 E6 60 02 00 00
{2381} normal block at 0x00000260E6A3BC80, 48 bytes long.
Data: <google.protobuf.> 67 6F 6F 67 6C 65 2E 70 72 6F 74 6F 62 75 66 2E
I suspect that happens because I am using Unicode characters set and shared MFC DLL, while the examples compile with MBCS. How I can solve it?

I suggest you try running inference using OpenVINO 2020.4.
Do let us know if this works for you.

gRPC memory leaks

My Visual Studio project, that use gRPC library have memory leaks. After some R&D I made a little project to reproduce the problem and found that don't even need to call any gRPC object in my code.
My steps:
1) Get helloworld.proto from examples
2) Generate C++ files
3) Create C++ project with next code:
#include "helloworld.grpc.pb.h"
void f(){
helloworld::HelloRequest request;
}
int main(){
_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
return 0;
}
Part of Output(full dump have 240 lines):
Detected memory leaks!
Dumping objects ->
{1450} normal block at 0x00FD77A0, 16 bytes long.
Data: <`{ t C | > 60 7B FD 00 20 74 FD 00 84 43 CA 00 88 7C CA 00
{1449} normal block at 0x00FECA30, 48 bytes long.
Data: <google.protobuf.> 67 6F 6F 67 6C 65 2E 70 72 6F 74 6F 62 75 66 2E
{1448} normal block at 0x00FEA048, 8 bytes long.
Data: < > 20 C6 FE 00 00 00 00 00
{1447} normal block at 0x00FEC610, 52 bytes long.
Data: < v p" v > B8 76 FC 00 70 22 FE 00 B8 76 FC 00 00 00 CD CD
{1441} normal block at 0x00FEA610, 32 bytes long.
Data: <google.protobuf.> 67 6F 6F 67 6C 65 2E 70 72 6F 74 6F 62 75 66 2E
{1440} normal block at 0x00FE9B78, 8 bytes long.
If I add google::protobuf::ShutdownProtobufLibrary(); line before return 0;, I will have much less output. Only that:
Detected memory leaks!
Dumping objects ->
{160} normal block at 0x00FCD858, 4 bytes long.
Data: < > 18 D6 B9 00
{159} normal block at 0x00FCD618, 4 bytes long.
Data: < > > C8 3E B9 00
{158} normal block at 0x00FCD678, 4 bytes long.
Data: < ? > D0 3F B9 00
Object dump complete.
But if I include some addition generated sources with many and big services and messages, memory dump will be much bigger.
So since I really don't use any gRPC objects directly, only one think I can imagine is that some static objects still alive when VS Memory Dumper start to work.
Is there a way to fix it or a suggestion what I can do with that?
UPD:
I made some addition work around this problem and open new issue on grpc repository bug tracker: https://github.com/grpc/grpc/issues/22506
Problem Description on that issue contain screenshots with leaked allocations callstack and gRPC debug traces.
UPD2:
I found all of them(1.23.0 version). I leaved the detailed comment there: https://github.com/grpc/grpc/issues/22506#issuecomment-618406755

C++ Converting string packet to iphdr - what should be the format of string packet?

So I have this line of code:
struct iphdr *ip_header = (struct iphdr*) packet.c_str();
from ip.h:
struct iphdr
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
unsigned int ihl:4;
unsigned int version:4;
#elif __BYTE_ORDER == __BIG_ENDIAN
unsigned int version:4;
unsigned int ihl:4;
#else
# error "Please fix <bits/endian.h>"
#endif
u_int8_t tos;
u_int16_t tot_len;
u_int16_t id;
u_int16_t frag_off;
u_int8_t ttl;
u_int8_t protocol;
u_int16_t check;
u_int32_t saddr;
u_int32_t daddr;
/*The options start here. */
};
I captured a DNS packet using wireshark and I got this sample packet:
0000 e0 8e 3c 1c c0 07 ac bc 32 83 84 d9 08 00 45 00
0010 00 3f 51 45 00 00 40 11 aa b3 c0 a8 fe 65 c0 a8
0020 fe fe 0e 76 00 35 00 2b d5 1c 9c 0a 01 00 00 01
0030 00 00 00 00 00 00 03 77 77 77 06 67 6f 6f 67 6c
0040 65 03 63 6f 6d 02 70 68 00 00 01 00 01
I removed the eth header and so I'm left with this:
0000 45 00
0010 00 3f 51 45 00 00 40 11 aa b3 c0 a8 fe 65 c0 a8
0020 fe fe 0e 76 00 35 00 2b d5 1c 9c 0a 01 00 00 01
0030 00 00 00 00 00 00 03 77 77 77 06 67 6f 6f 67 6c
0040 65 03 63 6f 6d 02 70 68 00 00 01 00 01
The first part (45 00 00 3f 51 45 00 00 40 11) translates to this:
45 0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
00 Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
00 3f Total Length: 63
51 45 Identification: 0x5145 (20805)
00 00 Flags: 0x00
Fragment offset: 0
40 Time to live: 64
11 Protocol: UDP (17)
My question is: what should be the format of the string variable packet? I have tried this:
std::string packet = "45 00 00 3f 51 45 00 00 40 11";
but for ip_header->protocol I get 48 '0' instead of 17.
Also I'm wondering, why is the protocol not on the 9th byte? I was assuming it should be on the 9th based on the structure of iphdr.
Would highly appreciate anyone's help. Thanks a lot!

Your basic assumption has some problems. You're using a string and you assume that if you cast it to some structure definition it will automatically (and auto-magically) convert it to the proper binary representation of that structure definition. This is not the case. Let's say you have a structure 'struct Test { unsigned int t; }' and a string 'std::string st = "12"'. And you do 'struct Test *pt = st.c_str();'. The ASCII representation of "12" would be 0x31 0x32 so now *pt points to a memory location starting with 31 32. Casting this to in integer (assume we have a big-endian system and assume the unsigned int is two bytes) results in 0x3132 (decimal 12594).

How to read the model of monitor from the EDID?

In the registry there is one (or more) key depending how many monitors you have HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\DISPLAY\DEL404C{Some Unique ID}\Device Parameters\EDID which is a REG_BINARY key. In my case this is :
00 ff ff ff ff ff ff 00 4c 2d 6f 03 39 31 59 4d
07 12 01 03 0e 29 1a 78 2a 80 c5 a6 57 49 9b 23
12 50 54 bf ef 80 95 00 95 0f 81 80 81 40 71 4f
01 01 01 01 01 01 9a 29 a0 d0 51 84 22 30 50 98
36 00 ac ff 10 00 00 1c 00 00 00 fd 00 38 4b 1e
51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
00 48 56 44 51 32 30 36 37 37 37 0a 20 20 00 ef
My question is how can I read only model of monitor ("SyncMaster" for example) and not all of the information using C or C++?
The format of EDID is described here: http://en.wikipedia.org/wiki/Extended_display_identification_data

What you're interested in here is the descriptor blocks of the EDID, which are found in the byte ranges 54-71, 72-89, 90-107, and 108-125. Here's those four blocks in your EDID:
#1: 9a29 a0d0 5184 2230 5098 3600 acff 1000 00
#2: 0000 00fd 0038 4b1e 510e 000a 2020 2020 20
#3: 0000 00fc 0053 796e 634d 6173 7465 720a 20
#4: 0000 00ff 0048 5644 5132 3036 3737 370a 00
You can identify the descriptor containing the monitor name because the first three bytes are all zero (so it isn't a detailed timing descriptor), and the fourth one byte FC (indicating the type). The fifth byte and beyond contain the name, which is here:
5379 6e63 4d61 7374 6572 0a20 SyncMaster..
So, in short: Check at offsets 54, 72, 90, and 108 for the sequence 00 00 00 FC; if you find a match, the monitor name is the next 12 bytes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Visual Studio Character Sets 'Not set' vs 'Multi byte character set' - c++

Related

Regex to match part of a hex

Intel OpenVINO memory leaks with VS2017

gRPC memory leaks

C++ Converting string packet to iphdr - what should be the format of string packet?

How to read the model of monitor from the EDID?

Categories

Resources