Generate string by switching ASCII code (extended) - c++

I have a list of strings that were "encrypted" with a simple char code switching. To make them readable I need to increase each character code by 7. The problem is that those codes can be above 127 (extended ASCII). I'm iterating each seed string and trying to produce a new string like this
string result = "";
for(std::string::iterator it = myEncryptedWord.begin(); it != myEncryptedWord.end(); ++it)
{
unsigned char myC = *it;
result+= char(myC+7);
}
It does not work for codes above 127 (characters like Ñóáã...). Needless to say that I can't control the original string list and change this poor "encryption".

Related

How to split a string by emojis in C++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"πŸ˜€", "πŸ”", "πŸ¦‘", "😁", "πŸ”", "πŸŽ‰", "πŸ˜‚", "🀣"};
Edit
I've tried to do:
std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

C++ tolower/toupper char pointer

Do you guys know why the following code crash during the runtime?
char* word;
word = new char[20];
word = "HeLlo";
for (auto it = word; it != NULL; it++){
*it = (char) tolower(*it);
I'm trying to lowercase a char* (string). I'm using visual studio.
Thanks
You cannot compare it to NULL. Instead you should be comparing *it to '\0'. Or better yet, use std::string and never worry about it :-)
In summary, when looping over a C-style string. You should be looping until the character you see is a '\0'. The iterator itself will never be NULL, since it is simply pointing a place in the string. The fact that the iterator has a type which can be compared to NULL is an implementation detail that you shouldn't touch directly.
Additionally, you are trying to write to a string literal. Which is a no-no :-).
EDIT:
As noted by #Cheers and hth. - Alf, tolower can break if given negative values. So sadly, we need to add a cast to make sure this won't break if you feed it Latin-1 encoded data or similar.
This should work:
char word[] = "HeLlo";
for (auto it = word; *it != '\0'; ++it) {
*it = tolower(static_cast<unsigned char>(*it));
}
You're setting word to point to the string literal, but literals are read-only, so this results in undefined behavior when you assign to *it. You need to make a copy of it in the dynamically-allocated memory.
char *word = new char[20];
strcpy(word, "HeLlo");
Also in your loop you should compare *it != '\0'. The end of a string is indicated by the character being the null byte, not the pointer being null.
Given code (as I'm writing this):
char* word;
word = new char[20];
word = "HeLlo";
for (auto it = word; it != NULL; it++){
*it = (char) tolower(*it);
This code has Undefined Behavior in 2 distinct ways, and would have UB also in a third way if only the text data was slightly different:
Buffer overrun.
The continuation condition it != NULL will not be false until the pointer it has wrapped around at the end of the address range, if it does.
Modifying read only memory.
The pointer word is set to point to the first char of a string literal, and then the loop iterates over that string and assigns to each char.
Passing possible negative value to tolower.
The char classification functions require a non-negative argument, or else the special value EOF. This works fine with the string "HeLlo" under an assumption of ASCII or unsigned char type. But in general, e.g. with the string "BlΓ₯bΓ¦rsyltetΓΈy", directly passing each char value to tolower will result in negative values being passed; a correct invocation with ch of type char is (char) tolower( (unsigned char)ch ).
Additionally the code has a memory leak, by allocating some memory with new and then just forgetting about it.
A correct way to code the apparent intent:
using Byte = unsigned char;
auto to_lower( char const c )
-> char
{ return Byte( tolower( Byte( c ) ) ); }
// ...
string word = "Hello";
for( char& ch : word ) { ch = to_lower( ch ); }
There are already two nice answers on how to solve your issues using null terminated c-strings and poitners. For the sake of completeness, I propose you an approach using c++ strings:
string word; // instead of char*
//word = new char[20]; // no longuer needed: strings take care for themseves
word = "HeLlo"; // no worry about deallocating previous values: strings take care for themselves
for (auto &it : word) // use of range for, to iterate through all the string elements
it = (char) tolower(it);
Its crashing because you are modifying a string literal.
there is a dedicated functions for this
use
strupr for making string uppercase and strlwr for making the string lower case.
here is an usage example:
char str[ ] = "make me upper";
printf("%s\n",strupr(str));
char str[ ] = "make me lower";
printf("%s\n",strlwr (str));

How do i detect white space or numbers when using UTF8CPP?

This is my code:
std::vector<std::string> InverseIndex::getWords(std::string line)
{
std::vector<std::string> words;
char* str = (char*)line.c_str();
char* end = str + strlen(str) + 1;
unsigned char symbol[5] = {0,0,0,0,0};
while( str < end ){
utf8::uint32_t code = utf8::next(str, end);
if(code == 0) continue;
utf8::append(code, symbol);
// TODO detect white spaces or numbers.
std::string word = (const char*)symbol;
words.push_back(word);
}
return words;
}
Input : "δ½  ε₯½ ε•Š ε“ˆε“ˆ 1234"
Output :
δ½ 
??
ε₯½
??
ε•Š
??
ε“ˆ
ε“ˆ
??
1??
2??
3??
4??
Expected output :
δ½ 
ε₯½
ε•Š
ε“ˆ
ε“ˆ
Is there anyway to skip the white space or numbers , thanks?
UTF8-CPP is nothing more than a tool for encoding and decoding strings into/outof UTF-8. Classification of Unicode codepoints is well outside the scope of that tool. You'll need to use a serious localization tool like Boost.Locale or ICU for that.
UTF-8 is "ASCII compatible" in the following sense:
If one of the bytes of the encoded string is equal to ASCII value - such as space, new line, or digits 0-9, this means that it is not a part of encoded sequence longer than a byte. It is actually this very character.
This means, that you can do isdigit() on a byte in UTF8 string as if it was an ASCII string, and it is guaranteed to work correctly.
For more information, see http://utf8everywhere.org the section on search.

How to read in only a particular number of characters

I have a small query regarding reading a set of characters from a structure. For example: A particular variable contains a value "3242C976*32" (char - type). How can I get only the first 8 bits of this variable. Kindly help.
Thanks.
Edit:
I'm trying to read in a signal:
For Ex: $ASWEER,2,X:3242C976*32
into this structure:
struct pg
{
char command[7]; // saves as $ASWEER,2,X:3242C976*32
char comma1[1]; // saves as ,2,X:3242C976*32
char groupID[1]; // saves as 2,X:3242C976*32
char comma2[1]; // etc
char handle[2]; // this is the problem, need it to save specifically each part, buts its not
char canID[8];
char checksum[3];
}m_pg;
...
When memcopying buffer into a structure, it works but because there is no carriage returns it saves the rest of the signal in each char variable. So, there is always garbage at the end.
you could..
convert your hex value in canID to float(depending on how you want to display it), e.g.
float value1 = HexToFloat(m_pg.canID); // find a conversion script for HexToFloat
CString val;
val.Format("0.3f",value1);
the garbage values aren't actually being stored in the structure, it only displays it as so, as there is no carriage return, so format the message however you want to and display it using the CString val;
If "3242C976*3F" is a c-string or std::string, you can just do:
char* str = "3242C976*3F";
char first_byte = str[0];
Or with an arbitrary memory block you can do:
SomeStruct memoryBlock;
char firstByte;
memcpy(&firstByte, &memoryBlock, 1);
Both copy the first 8bits or 1 byte from the string or arbitrary memory block just as well.
After the edit (original answer below)
Just copy by parts. In C, something like this should work (could also work in C++ but may not be idiomatic)
strncpy(m_pg.command, value, 7); // m.pg_command[7] = 0; // oops
strncpy(m_pg.comma, value+7, 1); // m.pg_comma[1] = 0; // oops
strncpy(m_pg.groupID, value+8, 1); // m.pg_groupID[1] = 0; // oops
strncpy(m_pg.comma2, value+9, 1); // m.pg_comma2[1] = 0; // oops
// etc
Also, you don't have space for the string terminator in the members of the structure (therefore the oopses above). They are NOT strings. Do not printf them!
Don't read more than 8 characters. In C, something like
char value[9]; /* 8 characters and a 0 terminator */
int ch;
scanf("%8s", value);
/* optionally ignore further input */
while (((ch = getchar()) != '\n') && (ch != EOF)) /* void */;
/* input terminated with ch (either '\n' or EOF) */
I believe the above code also "works" in C++, but it may not be idiomatic in that language
If you have a char pointer, you can just set str[8] = '\0'; Be careful though, because if the buffer is less than 8 (EDIT: 9) bytes, this could cause problems.
(I'm just assuming that the name of the variable that already is holding the string is called str. Substitute the name of your variable.)
It looks to me like you want to split at the comma, and save up to there. This can be done with strtok(), to split the string into tokens based on the comma, or strchr() to find the comma, and strcpy() to copy the string up to the comma.

Using Poco XMLWriter with UTF8 strings in C++

I have a problem trying to get my head around using UTF8 with Poco::XML::XMLWriter. In the following code example, everything works fine when the input contains ASCII characters. However, occasionally the string in wordmapIt->first contains a non-ASCII value, such as a character -105 occurring in the middle of a string. When this happens the xml stream seems to terminate on the -105 char even though there are many other words after this one. I want to save whatever string was there so just stripping the char out isn't the right answer - theres got to be some kind of encoding I can apply (I think) but what?
I'm clearly missing something conceptually but for the life of me I cant figure out the right way to do this.
Poco::XML::XMLString EDocument::makeXMLString()
{
std::stringstream xmlstream;
Poco::UTF8Encoding utf8encoding;
Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);
writer.startDocument();
std::map<std::string, std::string>::iterator wordmapIt;
for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
{
writer.startElement("", "", "word");
writer.characters(Poco::XML::toXMLString(wordmapIt->first));
writer.endElement("", "", "word");
}
writer.endDocument();
return xmlstream.str();
}
Edit:
Solution based on answer below.
Poco::XML::XMLString EDocument::makeXMLString()
{
std::stringstream xmlstream;
Poco::UTF8Encoding utf8encoding;
Poco::XML::XMLWriter writer(xmlstream, 0, "UTF-8", &utf8encoding);
Poco::Windows1252Encoding windows1252encoding;
Poco::UTF8Encoding utf8encoding;
Poco::TextConverter textconverter(windows1252encoding, utf8encoding);
writer.startDocument();
std::map<std::string, std::string>::iterator wordmapIt;
for ( wordmapIt = nodeinfo->wordmap.begin(); wordmapIt != nodeinfo->wordmap.end(); wordmapIt++ )
{
std::string strword;
textconverter.convert(wordmapIt->first, strword);
writer.startElement("", "", "word");
writer.characters(strword);
writer.endElement("", "", "word");
}
writer.endDocument();
return xmlstream.str();
}
It sounds like you have a byte string in Windows code page 1252 encoding. β€œCharacter -105” presumably really means byte 0x97, which would map to Unicode character U+2014 Em Dash (β€”) in cp1252.
I'm not familiar with Poco, but I would guess you're expected to convert your cp1252 strings to UTF-8 output encoding using a TextConverter with Windows1252Encoding and UTF8Encoding.
Although if what you really have is an β€œANSI string” (a byte string in the default code page for the current machine's locale), 1252 might not be the right answer and you might have to use a function from another library to do the conversion properly.