How do you skip a file's BOM before parsing it? - d

A Unicode file can contain a BOM at the start of the file. std.file.readText() will verify that this BOM is appropriate for the encoding it is saving to (string, wstring, dstring) but leaves the BOM as part of the range.
Parsers generally don't expect to be parsing a file but instead just a string which doesn't have a BOM specification since the type is already known.
How do I go about reading a file and removing the BOM if it exists?

The simplest way I've identified is to utilize std.encoding to get the BOM and skip over it.
import std.file;
auto fileContent = readText(file);
As Jonathan mentioned, it wouldn't work for non UTF8 encoding so here is a tested function which works with string, wstring, dstring and tested.
import std.traits: isSomeString;
STR skipBom(STR)(STR fileContent) if(isSomeString!STR) {
import std.encoding : getBOM, BOM;
import std.algorithm : skipOver;
import std.traits: CopyTypeQualifiers;
auto byteArray = cast(CopyTypeQualifiers!(STR, ubyte[]))fileContent;
if(getBOM(byteArray).schema != BOM.none)
byteArray.skipOver(getBOM(byteArray).sequence);
return cast(STR)byteArray;
} unittest {
string s = "\xEF\xBB\xBFTesting UTF8";
assert(skipBom(s) == "Testing UTF8");
} unittest {
wstring s = [0xFEFF,'T', 'e', 's', 't', 'i', 'n', 'g', ' ', 'U', 'T', 'F', '1', '6'];
assert(skipBom(s) == "Testing UTF16");
} unittest {
dstring s = [0x0000FEFF,'T', 'e', 's', 't', 'i', 'n', 'g', ' ', 'U', 'T', 'F', '3', '2'];
assert(skipBom(s) == "Testing UTF32");
}

Related

C++ string checks substring with find_first_of function from std lib

I am tying to use the find_first_of function from the C++ lib to check if a string contains a certain substring, the result is not quite what I expected.
I have the code like below
const std::wstring_view expected{ L"abc-1" };
const std::wstring_view result = GetResult(); // result = L"abc-2-1" from function return
if (result.find_first_of(expected) == 0) {
.....
}
When I debug this, the code runs into the if scope which means it found the matching substring from the position "0". Is this how this api expected to work? I think I might be missing something here.
std::basic_string_view::find_first_of returns the position of the first occurrence of any of the characters within the string (or std::basic_string_view::npos if none are found).
In other words, it gives the position of the first 'a', 'b', 'c', '-', '2' or '1'.
Use std::basic_string_view::find to get the position of the first whole substring.
if (result.find(expected) == 0) {
// ...
}

find_last_of not working for xml strings

I have templated xml as shown below in which i need to find last ChildTag xml string.
<?xml version="1.0" encoding="UTF-8">
<Test xmlns:Test="http://www.w3.org/TR/html4/">
<TestID>1</TestID>
<TestData>
<ParentTag1>A</ParentTag1>
<ParentTag2>B</ParentTag2>
{{ChildTag}}
</TestData>
</Test>
ChildTag
<Tag1>E</Tag1>
<Tag2>F</Tag2>
So the approach i followed is find last of ChildTag in that string and take substring from that position. Below is the sample code for this and it should be noted that I am reading this xml from file:
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
using namespace std;
int main()
{
std::ifstream fin("abc.xml");
fin.unsetf(ios_base::skipws);
std::string fileData = std::string(std::istream_iterator<char>(fin),std::istream_iterator<char>());
std::cout<<fileData<<std::endl;
auto childxmlindex = fileData.find_last_of("ChildTag");
std::cout<<childxmlindex<<std::endl;
std::cout<<"Child XML : "<<fileData.substr(childxmlindex)<<std::endl;
return 0;
}
Issue is with the line fileData.find_last_of("ChildTag") because it is giving random number nothing to do with actual index.Is there any issue with this string that is causing find_last_of to fail ?
Your expectation of find_last_of is wrong:
Searches the string for the last character that matches any of the characters specified in its arguments
So fileData.find_last_of("ChildTag"); return the position matching of the letter 'C', 'h', 'i', 'l', 'd', 'T', 'a', 'g'. 'g' in your case.
Your are looking for rfind.

Extracting characters from window and Linux

Basically, window text editor puts "\r" ,"\n" at the end. So when i have a word "compile" in window's file, it is actually "compile\n\r",
when i extract characters by using
char letter;
fin.get(Letter);
from the file and put my linked list taking char character
list<char> myList;
I would get {'c', 'o', 'm', 'p', 'l', 'i', 'e', '\r', '\n'} in my list.
Then when I call
itr = myList.end();
it will give the iterator containing value '\n', is that right?? So if i want to access to the 'e', I have to do "--itr" twice. Is that right??
Then when it is Linux, I would have {'c', 'o', 'm', 'p', 'l', 'i', 'e', '\n'}, and calling "itr = myList.end()" will give me the iterator containing the value '\n', So I have to do "--iterator" to get to character 'e'. Is my understanding correct?
Basically, I am using notepad for my text file and when i have a word "compile" with no space and when i calling "itr = myList.end()" It gives me the iterator containing some space and i don't know what it is. Then when I do "--itr" then It gives me the iterator containing the last letter while I am expecting to have the last letter iterator when i do "--itr" twice becuase it is window's text file.
Could anyone explain what is going on??
First of all, as NathanOliver pointed out, std::list::end returns an iterator to the element following the last element of the container.
In your case, with Windows CRLF end of lines, auto it = myList.end(); it-- will have it containing LF (0x0A). In the case of Linux LF end of lines, it will also contain LF.
A second it-- will have it point to 0x0D in the case of a Windows CRLF file, or the e from compile you are looking for in your example.
So you can use a simple conditional on a second --it to check if it is 0x0D. If it is, you know the file is a Windows format and will need to decrement iterator once more to get to the last character.
To illustrate this, look at the following code. It's very limited: no error checking, no bounds checkings, etc..
Please note that there are much better ways to handle opening / handling a file with unknown line endings than the sample code below
int main(int argc, char** argv)
{
char buffer;
list<char> l;
ifstream f;
// f.open("crlf.txt");
f.open("lf.txt");
while (f.get(buffer).good())
l.push_back(buffer);
auto it = l.end();
it--;
if (*it == 0x0A) // if true it's LF or CRLF file
{
--it;
if (*it == 0x0D) // if true it's CRLF
{
cout << "File is CRLF / Windows" << endl;
--it; // get to the char before the newline
}
else
cout << "File is LF / Linux" << endl;
}
// 'it' here always refers to the last character
cout << "'it' points to " << *it << endl;
return 0;
}
To test the code below I did the following:
echo "compile" > lf.txt
cp lf.txt crlf.txt
unix2dos crlf.txt

Generalizing Keyboard Input

I'm currently writing on an engine for game development and I'm having a hard time in figuring out how to best generalize keyboard input.
I want to do something like:
void onKeyDown(int key)
{
// Set key down state to true.
}
void onKeyUp(int key)
{
// Set key down state to false.
}
void update()
{
if (/* is key down */)
{
// Do something.
}
}
Now I need to know in which data structure I have to save my key states. At first I thought a bool[256] array would be sufficient but what about the control characters?
Then I thought about using a map like map< int, bool>. Would this work for the different languages? I would want to check the key state with key_states['a'] or with defined constants for unprintable characters like key_states[Key::Enter]. What if someone from asia would input some asian character? Would that work too?
On my linux machine there are 3 different 'representations' of a key. First there is a key code which seems to be impractical for what I'm trying to do. The next are representations are keysym and a string representation. Because I also need to know when a non-printable key is pressed (like e.g. the enter key) the string representation is not usable either I guess.
So that leaves me with the keysym. This is a little test output I generated on my linux (I clicked 'a', 'Shift-a', 'Ctrl-a' and 'Alt-a':
DOWN: KeyCode: 38 '&', KeySym: 97 'a', String: 'a'
DOWN: KeyCode: 50 '2', KeySym: 65505 '<E1>', String: ''
DOWN: KeyCode: 38 '&', KeySym: 65 'A', String: 'A'
DOWN: KeyCode: 37 '%', KeySym: 65507 '<E3>', String: ''
DOWN: KeyCode: 38 '&', KeySym: 97 'a', String: '^A'
DOWN: KeyCode: 64 '#', KeySym: 65513 '<E9>', String: ''
DOWN: KeyCode: 38 '&', KeySym: 97 'a', String: 'a'
It seems like the keysym is always the same as the string representation for printable characters. Is this guaranteed for all languages? If it is I could always take the keysym as key into the key_states map. Then the above use case would look like:
void onKeyDown(int keysym)
{
key_states[keysym] = true;
}
void onKeyUp(int keysym)
{
key_states[keysym] = false;
}
void update()
{
// Check if Ctrl-a was pressed.
if (key_states[Key::Ctrl] && key_states['a'])
{
// Do something.
}
}
Would this work platform and language independently?
Thanks in advance.

getting information from standard output ( C++ )?

How can I read the below information from standard output?
Fmail#yasar.com\0Tketo#keeto.com\0Tmail#lima.com\0\0
I want to have the entire information, including the \0 characters.
With such code:
string s;
fstream fs("/dev/stdout", fstream::in);
fs >> s;
If I write s to a file I get this output:
Ftest555#itap.gov.trTislam.yasar#inforcept.comTaa#test.comTbb#test.com
All \0 and \0\0 are lost.
How can I do that?
This is just a matter of processing the output correctly in your shell.
Imagine this:
cat file_with_nulls
This will recklessly print the content of file_with_nulls to the console, and of course the console may not be equipped to display non-printable characters. However, the following works:
cat file_with_nulls > otherfile
This will create a perfect copy of file_with_nulls.
The same works with your program. You can write anything you want to the standard output. But don't expect your terminal or console to do anything useful with it! Rather, redirect the output to a file and all is well:
./myprog > output.bin
Note that the C string operations don't usually work with null bytes, so in C you should use fwrite(). In C++, strings can contain any character, so std::cout << str; always works. However, constructing an std::string from a C character array stops at the null byte, so you have to use a different constructor:
char cstr[] = { 'H', 'e', 0, 'l', 'l', 'o', 0 };
std::string s1(cstr); // wrong, gives you "He"
std::string s2(cstr, sizeof(cstr)); // correct
just specify binary mode:
std::string result;
std::fstream fs( "/dev/stdout", std::fstream::in|std::fstream::binary );
while ( !fs.eof() ) {
std::string b;
fs >> b;
result += b;
}
fs.close();
I test it with file created by:
std::fstream out( "C:\\tmp\\test1.txt", std::fstream::out );
out.write( "aaa\n\0\0bbb\0ccc", 13 );
out.close();
But then you'll have to access data with iterators (result.begin(), result.end()) because c_str() call will truncate on '\0'