UTF-32, why is it taking up 8 bytes? - utf-32

I have been reading all about Unicode lately, because it's pretty interesting how it all works.
So I've read that UTF-32 is a fixed 4 bytes. Well, I thought it was odd, when on both my MacBook Airs, when I saved a simple file, with one letter (t) in it, it saved with 8 bytes. This also happened with UTF-16, which took up 4 bytes (not as odd though). Anyone know why?
Note: I did check, there's no white space in it

There is most likely a UTF BOM being saved at the beginning of the file in front of the t character. A BOM is used to specify which UTF encoding is being used to encode the file, and in the case of UTF-16 and UTF-32 which endian is being used.
UTF-16LE: BOM (2 bytes) + t (2 bytes) = 4 bytes
FF FE 74 00
UTF-16BE: BOM (2 bytes) + t (2 bytes) = 4 bytes
FE FF 00 74
UTF-32LE: BOM (4 bytes) + t (4 bytes) = 8 bytes
FF FE 00 00 74 00 00 00
UTF-32BE: BOM (4 bytes) + t (4 bytes) = 8 bytes
00 00 FE FF 00 00 00 74

Related

How do you search for specific text in an entire string of text using RegEx in Notepad++?

I'm trying to figure out how to capture specific text in a log file that will only capture text within the first 25 character of a line of text. This is using the Analyse Plugin in Notepad++.
Example:
0.469132 CANFD 1 Rx 122f1 1 0 d 32 05 d3 07 ca 00 1f 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 09 a0 00 00 00 00 00 00 00 00
In the example above, I have written the following regex code
RegEx code:
(x|rx\s+(...))\s+\d\s+\d\s+(\d|\D)\s+(\d|\D|\d\d|\D\D)\s+.*?(?:(02\s(11|51)\s01))
This code will return the line if it sees 11 01 or 51 01 but I don't want to search the entire line I only want to search the next 25 characters after the \d\s+\d\s+(\d|\D)\s+(\d|\D|\d\d|\D\D).
Does anyone have any suggestions on how this can be done?

LZW encoding and the GIF file format

I'm trying to understand how to create a .gif file in C++. So far I think I understand everything except how the LZW encoding works. This is the file that I have generated with labels:
47 49 46 38 39 61 -header
0A 00 01 00 91 00 -logical screen descriptor
00 00 FF 00 FF 00 -color table [green,red,yellow,black]
00 FF FF 00 00 00
00 21 F9 04 00 00 -graphics control extension
00 00 00 2C 00 00 -image descriptor
00 00 0A 00 01 00 -(10 pixels wide x 1 pixel tall)
00 02 04 8A 05 00 -encoded image
3B -terminator
And here it is again without labels for copy/paste purposes: 47 49 46 38 39 61 05 00 04 00 91 00 00 00 FF 00 FF 00 00 FF FF 00 00 00 00 21 F9 04 00 00 00 00 00 2C 00 00 00 00 0A 00 01 00 00 02 04 8A 05 00 3B
I am having a lot of trouble understanding how 02 04 8A 05 translates to the image yryryggyry. I know the 02 is the minimum code size, and the 04 is the length of the image block, and I think I have identified the clear and EOI codes, but I don't understand the code inbetween.
8A 05
10001010 00000101
100|01010 00000|101
^ ???? ^
clear code EOI code
So far I have gotten the most information from the .gif specification:
http://www.w3.org/Graphics/GIF/spec-gif89a.txt
And this website was also helpful:
http://www.matthewflickinger.com/lab/whatsinagif/lzw_image_data.asp
Thanks
EDIT*
I watched the Youtube video linked in the comments and encoded the image manually for the color stream "yryryggyry":
Color table-012=gry
2 1 2 1 2 0 0 2 1 2
010 001 010 001 010 000 000 010 001 010
current next output dict
010 001 010 21 6
001 010 001 12 7
010 001 - -
001 010 110 121 8
010 000 010 212 9
000 000 000 00 10
000 010 1010 002 11
010 001 - -
001 010 110 -
010 - 010 -
outputs-100 010 001 110 010 000 1010 110 010 101
01010101 4th 55
10101000 3rd A8
00101100 2nd 2C
01010100 1st 54
Code-54 2C A8 55
I must have made a mistake because this code produces the image "yr" instead of "yryryggyry"
I am going to try to redo the work to see if I get a different answer
Perhaps you've made a mistake at line 4:
001 010 110 121 8
At line 3, "010" is ignored, so you have to add it to line 4 first.
And at line 4, it comes to:
current next output dict
010 001 010 010 001 212 8
Here is my solution(also manually created):
LZW for yryryggyry
Update:
Finally figured out why:
When you are encoding the data, you increase your code size as soon as your write out the code equal to 2^(current code size)-1. If you are decoding from codes to indexes, you need to increase your code size as soon as you add the code value that is equal to 2^(current code size)-1 to your code table. That is, the next time you grab the next section of bits, you grab one more.
The writer means that you should increase your word size when you are about to output 2^(current code size) - 1, but there might be a different explanation which also seems to be reasonable:
When you add #(2 ^ current code size) item to the code table, next output should increase its word size.
Also correct in the writer's example, and this is the explanation I prefer.
Here's your example ("yryryggyry"):
output sequence:
#4 #2 #1 #6 #2 #0 #0 #8 #5
When you are about to output #6, you add "yry" into your code table, which has an index of #8.
Since 8 = 2 ^ current word size
(current word size = 2(original) + 1(reserved) = 3)
Next output should increase word size,so #2 becomes a word of 4 bits.
And the final output sequence is:
4 100
2 010
1 001
6 110
2 0010
0 0000
0 0000
8 1000
5 0101
After encoding they become
54 2C 00 58
So the data block is
02 -minimum word size
04 -data length
54 2c 00 58 -data
00 -data block terminator

Very strange behavior writing binary file

I've run into an extremely perplexing issue when outputting binary data. I am writing a model converter that takes an ASCII file format and converts the values to binary. After parsing the file, I write the values out in binary. Everything works fine, up to a point, then things get weird.
The block of data that is being output looks like this:
struct vertex_t
{
glm::vec2 texcoord;
unsigned short weight_index_start;
unsigned char weight_count;
};
Here is the binary block of data in question, with a human readable value following the # symbol.
00 00 80 3F # 1.000
AA AA AA 3E # 0.333
00 00 # 0
01 # 1
00 00 40 3F # 0.750
AA AA AA 3E # 0.333
01 00 # 1
01 # 1
00 00 40 3F # 0.750
00 00 00 00 # 0.000
02 00 # 2
01 # 1
...
Everything looks swell, up until the 11th element, where something weird happens...
00 00 00 3F # 0.500
AA AA AA 3E # 0.333
09 00 # 9
01 # 1
FE FF 7F 39 # 2.4x10^-4
AA AA 2A 3F # 0.666
0D # 13 (why is this here?!)
0A 00 # 10
01 # 1
00 00 40 3F # 0.75
00 00 80 3F # 1.0
0B 00 # 11
01 # 1
As you can see, a 0D is written in the middle of the struct for no apparent reason. Here is the relevant block of code that exports this struct:
for (const auto& vertex : mesh.vertices)
{
ostream.write(reinterpret_cast<const char*>(&vertex.texcoord.x), sizeof(vertex.texcoord.x));
ostream.write(reinterpret_cast<const char*>(&vertex.texcoord.y), sizeof(vertex.texcoord.y));
ostream.write(reinterpret_cast<const char*>(&vertex.weight_index_start), sizeof(vertex.weight_index_start));
ostream.write(reinterpret_cast<const char*>(&vertex.weight_count), sizeof(vertex.weight_count));
}
I have no idea how this could happen, but perhaps I'm missing something. Any help would be appreciated!
It seems like you are pushing carriage and newline into the file.
OD : Carriage return
0A : NL line feed, New line
See the ASCII table at http://www.asciitable.com/
Try opening a file with ios::binary paramater
example:
fstream output("myfile.data", ios::out | ios::binary);
Hope this helps!
You don't show how you're writing the data. Generally speaking,
however, if the output format is not a text format, you must
open the file in binary mode; off hand, it looks as though
you're on a Windows system, and haven't done this. The binary
value 0x0A corresponds to '\n', and if the file is not opened
in binary mode, this will be converted to the system dependent
new line indicator—under Windows (and probably most other
non Unix systems), to the two byte sequence 0x0D, 0x0A.
Unless your output format is text, you must open the file in
binary mode, both when reading and when writing. And (often
forgotten) imbued with the "C" locale; otherwise, there may be
code translation.
RE your update, with the write methodes: this does not output
anything that you're guaranteed to be able to read later. (The
need for a reinterpret_cast should tip you off to that.) If
you're just spilling a too large data set to disk, to be reread
later by the same process, fine (provided that the structures
contain nothing but integral, floating point and enum types).
In all other cases, you need to define a binary format (or use
an already defined one, like XDR), format the output to it, and
parse it on input. (Doing this in a totally portable fashion
for floating point is decidedly non-trivial. On the other hand,
most applications don't need total portability, and if the
binary format is based on IEEE, and all of the targetted systems
use IEEE, you can usually get away with interpreting the bit
pattern of the floating point as an appropriately sized unsigned
int, and reading and writing this.)

String literals are created with prefix NULL (0)

Background:
I'm working on a legacy code of a web application and I'm currently converting some of the ASCII parts of the code to UNICODE. I've run in to the following bug in the logger. it seems that string literals are either created or for some reason corrupted along the way.
Example the following string - "%s::%s - Started with success." In the memory it looks like this.
2AF9BFC 25 00 73 00 3A 00 3A 00 %.s.:.:.
02AF9C04 25 00 73 00 20 00 2D 00 %.s. .-.
02AF9C0C 20 00 53 00 74 00 61 00 .S.t.a.
02AF9C14 72 00 74 00 65 00 64 00 r.t.e.d.
02AF9C1C 20 00 77 00 69 00 74 00 .w.i.t.
02AF9C24 68 00 20 00 73 00 75 00 h. .s.u.
02AF9C2C 63 00 63 00 65 00 73 00 c.c.e.s.
02AF9C34 73 00 2E 00 00 00 00 00 s.......
02AF9C3C 00 00 00 00 00 00 00 00 ........
In the log the string will look as following -_S_t_a_r_t_e_d_ _w_i_t_h _s_u_c_c_e_s_s
Where space is represented here as usual and the NULL char is represented by _ (The _ is only an example, different txt editors will show it in a different way).
I do use the _T macro which is replaces the string to be Unicode from what I learn here.
Why do I get the byte 0 prefix?
In Microsoft's terminology, "Unicode" means UTF-16 i.e. each character is represented by either one or two 16-bit code units. When an ASCII character is converted to a UTF-16, it will be represented as a single code unit with the high byte zero and the low byte containing the ASCII character.
If you want your log file to be readable as ASCII you need to convert your text to UTF-8 when writing it out. Otherwise, make sure that all text in the log file is UTF-16 and use a log file reader that understands UTF-16, but note that you'll waste up to 50% space if most of your text is ASCII (since every second byte will be 0).

C++ unicode file io

I need a file io library that can give my program a utf-16 (little endian) interface, but can handle files in other encodings, mainly ascii(input only), utf-8, utf-16, utf-32/ucs4 including both little and big endian byte orders.
Having looked around the only library I found was the ICU ustdio.h library.
I did try it however I coudlnt even get that to work with a very simple bit of text, and there is pretty much zero documentation on its useage, only the ICU file reference page which providse no examples and very little detail (eg having made a UFILE from an existing FILE, is it safe to use other functions that take the FILE*? along with several others...).
Also id far rather a c++ library that can give me a wide stream interface over a C style interface...
std::wstring str = L"Hello World in UTF-16!\nAnother line.\n";
UFILE *ufile = u_fopen("out2.txt", "w", 0, "utf-16");
u_file_write(str.c_str(), str.size(), ufile);
u_fclose(ufile);
output
Hello World in UTF-16!਍䄀渀漀琀栀攀爀 氀椀渀攀⸀ഀ
hex
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 0A
00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00 20
00 6C 00 69 00 6E 00 65 00 2E 00 0D 0A 00
EDIT: The correct output on windows would be:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00
6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00
55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 00
0A 00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00
20 00 6C 00 69 00 6E 00 65 00 2E 00 0D 00 0A 00
The problem you see comes from the linefeed conversion. Sadly, it is made at the byte level (after the code conversion) and is not aware of the encoding. IOWs, you have to disable the automatic conversion (by opening the file in binary mode, with the "b" flag) and, if you want 0A00 to be expanded to 0D00A00, you'll have to do it yourself.
You mention that you'd prefer a C++ wide-stream interface, so I'll outline what I did to achieve that in our software:
Write a std::codecvt facet using an ICU UConverter to perform the conversions.
Use an std::wfstream to open the file
imbue() your custom codecvt in the wfstream
Open the wfstream with the binary flag, to turn off the automatic (and erroneous) linefeed conversion.
Write a "WNewlineFilter" to perform linefeed conversion on wchars. Use inspiration from boost::iostreams::newline_filter
Use a boost::iostreams::filtering_wstream to tie the wfstream and the WNewlineFilter together as a stream.
I successfully worked with the EZUTF library posted on CodeProject:
High Performance Unicode Text File I/O Routines for C++
UTF8-CPP gives you conversion between UTF-8, 16 and 32. Very nice and light library.
About ICU, some comments by the UTF8-CPP creator :
ICU Library. It is very powerful,
complete, feature-rich, mature, and
widely used. Also big, intrusive,
non-generic, and doesn't play well
with the Standard Library. I
definitelly recommend looking at ICU
even if you don't plan to use it.
:)
I think the problems come from the 0D 0A 00 linebreaks. You could try if other linebreaks like \r\n or using LF or CR alone do work (best bet would be using \r, I suppose)
EDIT: It seems 0D 00 0A 00 is what you want, so you can try
std::wstring str = L"Hello World in UTF-16!\15\12Another line.\15\12";
You can try the iconv (libiconv) library.