I'm trying to figure out how to capture specific text in a log file that will only capture text within the first 25 character of a line of text. This is using the Analyse Plugin in Notepad++.
Example:
0.469132 CANFD 1 Rx 122f1 1 0 d 32 05 d3 07 ca 00 1f 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 09 a0 00 00 00 00 00 00 00 00
In the example above, I have written the following regex code
RegEx code:
(x|rx\s+(...))\s+\d\s+\d\s+(\d|\D)\s+(\d|\D|\d\d|\D\D)\s+.*?(?:(02\s(11|51)\s01))
This code will return the line if it sees 11 01 or 51 01 but I don't want to search the entire line I only want to search the next 25 characters after the \d\s+\d\s+(\d|\D)\s+(\d|\D|\d\d|\D\D).
Does anyone have any suggestions on how this can be done?
Background:
I'm working on a legacy code of a web application and I'm currently converting some of the ASCII parts of the code to UNICODE. I've run in to the following bug in the logger. it seems that string literals are either created or for some reason corrupted along the way.
Example the following string - "%s::%s - Started with success." In the memory it looks like this.
2AF9BFC 25 00 73 00 3A 00 3A 00 %.s.:.:.
02AF9C04 25 00 73 00 20 00 2D 00 %.s. .-.
02AF9C0C 20 00 53 00 74 00 61 00 .S.t.a.
02AF9C14 72 00 74 00 65 00 64 00 r.t.e.d.
02AF9C1C 20 00 77 00 69 00 74 00 .w.i.t.
02AF9C24 68 00 20 00 73 00 75 00 h. .s.u.
02AF9C2C 63 00 63 00 65 00 73 00 c.c.e.s.
02AF9C34 73 00 2E 00 00 00 00 00 s.......
02AF9C3C 00 00 00 00 00 00 00 00 ........
In the log the string will look as following -_S_t_a_r_t_e_d_ _w_i_t_h _s_u_c_c_e_s_s
Where space is represented here as usual and the NULL char is represented by _ (The _ is only an example, different txt editors will show it in a different way).
I do use the _T macro which is replaces the string to be Unicode from what I learn here.
Why do I get the byte 0 prefix?
In Microsoft's terminology, "Unicode" means UTF-16 i.e. each character is represented by either one or two 16-bit code units. When an ASCII character is converted to a UTF-16, it will be represented as a single code unit with the high byte zero and the low byte containing the ASCII character.
If you want your log file to be readable as ASCII you need to convert your text to UTF-8 when writing it out. Otherwise, make sure that all text in the log file is UTF-16 and use a log file reader that understands UTF-16, but note that you'll waste up to 50% space if most of your text is ASCII (since every second byte will be 0).
First - my apologies if this has been answered a hundred times over! D'oh!
But my search-fu apparently sucks, as I'm having no luck answering this basic question:
How are resources stored in the EXE/DLL? As UNICODE (UCS-2, Windows native internal character format), or as multibyte characters using the code-page of the resources block?
How does one embed UNICODE strings into one's resources (.rc)?
Can UNICODE (UCS-2) text be inserted into the language strings from within VS 2012?
Is Windows still using UCS-2, or is it using UTF16 internally?
I'm just looking for general answers, or links to details, rather than a detailed how-to for putting a UNICODE string into an .rc string table. Thanks!
All resource strings in WIN32 are compiled as Unicode. See here for more info. The .rc script itself can be ANSI (using the local codepage) or UCS-2 with the appropriate BOM (reference).
If in doubt take a look at the hex. Here the start of notepad.exe's rc file, in UTF16:
0002ed60 01 00 53 00 74 00 72 00 69 00 6e 00 67 00 46 00 |..S.t.r.i.n.g.F.|
0002ed70 69 00 6c 00 65 00 49 00 6e 00 66 00 6f 00 00 00 |i.l.e.I.n.f.o...|
0002ed80 a6 02 00 00 01 00 30 00 34 00 30 00 39 00 30 00 |......0.4.0.9.0.|
0002ed90 34 00 42 00 30 00 00 00 4c 00 16 00 01 00 43 00 |4.B.0...L.....C.|
0002eda0 6f 00 6d 00 70 00 61 00 6e 00 79 00 4e 00 61 00 |o.m.p.a.n.y.N.a.|
0002edb0 6d 00 65 00 00 00 00 00 4d 00 69 00 63 00 72 00 |m.e.....M.i.c.r.|
0002edc0 6f 00 73 00 6f 00 66 00 74 00 20 00 43 00 6f 00 |o.s.o.f.t. .C.o.|
0002edd0 72 00 70 00 6f 00 72 00 61 00 74 00 69 00 6f 00 |r.p.o.r.a.t.i.o.|
There is a good writeup of the issue here.
The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8
https://devblogs.microsoft.com/oldnewthing/20190607-00/?p=102569
what for a regex is needed for formating some lines from
0 0 0
00 00 00
000 000 00
0 00 000
000 00 0
000 0 000
to
0 0 0
00 00 00
000 000 000
0 00 000
000 00 0
000 0 000
?
TIA
Watch this:
There are times when you can improve the readability of your code by lining up the elements on neighbouring lines. In this episode, I demonstrate how this can be achieved using the Tabular plugin...
For such tasks, I'd take a look at "DrChips Alignment Tool for Vim"
I need a file io library that can give my program a utf-16 (little endian) interface, but can handle files in other encodings, mainly ascii(input only), utf-8, utf-16, utf-32/ucs4 including both little and big endian byte orders.
Having looked around the only library I found was the ICU ustdio.h library.
I did try it however I coudlnt even get that to work with a very simple bit of text, and there is pretty much zero documentation on its useage, only the ICU file reference page which providse no examples and very little detail (eg having made a UFILE from an existing FILE, is it safe to use other functions that take the FILE*? along with several others...).
Also id far rather a c++ library that can give me a wide stream interface over a C style interface...
std::wstring str = L"Hello World in UTF-16!\nAnother line.\n";
UFILE *ufile = u_fopen("out2.txt", "w", 0, "utf-16");
u_file_write(str.c_str(), str.size(), ufile);
u_fclose(ufile);
output
Hello World in UTF-16!䄀渀漀琀栀攀爀 氀椀渀攀⸀ഀ
hex
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 0A
00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00 20
00 6C 00 69 00 6E 00 65 00 2E 00 0D 0A 00
EDIT: The correct output on windows would be:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 00
0A 00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00
20 00 6C 00 69 00 6E 00 65 00 2E 00 0D 00 0A 00
The problem you see comes from the linefeed conversion. Sadly, it is made at the byte level (after the code conversion) and is not aware of the encoding. IOWs, you have to disable the automatic conversion (by opening the file in binary mode, with the "b" flag) and, if you want 0A00 to be expanded to 0D00A00, you'll have to do it yourself.
You mention that you'd prefer a C++ wide-stream interface, so I'll outline what I did to achieve that in our software:
Write a std::codecvt facet using an ICU UConverter to perform the conversions.
Use an std::wfstream to open the file
imbue() your custom codecvt in the wfstream
Open the wfstream with the binary flag, to turn off the automatic (and erroneous) linefeed conversion.
Write a "WNewlineFilter" to perform linefeed conversion on wchars. Use inspiration from boost::iostreams::newline_filter
Use a boost::iostreams::filtering_wstream to tie the wfstream and the WNewlineFilter together as a stream.
I successfully worked with the EZUTF library posted on CodeProject:
High Performance Unicode Text File I/O Routines for C++
UTF8-CPP gives you conversion between UTF-8, 16 and 32. Very nice and light library.
About ICU, some comments by the UTF8-CPP creator :
ICU Library. It is very powerful,
complete, feature-rich, mature, and
widely used. Also big, intrusive,
non-generic, and doesn't play well
with the Standard Library. I
definitelly recommend looking at ICU
even if you don't plan to use it.
:)
I think the problems come from the 0D 0A 00 linebreaks. You could try if other linebreaks like \r\n or using LF or CR alone do work (best bet would be using \r, I suppose)
EDIT: It seems 0D 00 0A 00 is what you want, so you can try
std::wstring str = L"Hello World in UTF-16!\15\12Another line.\15\12";
You can try the iconv (libiconv) library.