c++ remove non utf8 - c++

I am working to validate that a string is utf8.
I have found method g_utf8_validate from glib, which returns:
true/false
the location of the last valid data that was read from the string
Is there a posibility to ge beyond this, and also get the valid data after the non-utf8 portion? Example:
std::string invalid = "okdata\xa0\xa1morevalid";
Currenlty I am able to save "okdata" but I would like to get "okdatamorevalid".
Any ideas? Thank you.

You could keep calling g_utf8_validate on the remaining string (skipping the first byte every time) to find more valid sections:
#include <iostream>
#include <string>
#include <glib.h>
int main() {
char const *data = "okdata\xa0\xa1morevalid";
std::string s;
// Under the assumption that the string is null-terminated.
// Otherwise you'll have to know the length in advance, pass it to
// g_utf8_validate and reduce it by (pend - p) every iteration. The
// loop condition would then be remaining_size > 0 instead of *pend != '\0'.
for(char const *p = data, *pend = data; *pend != '\0'; p = pend + 1) {
g_utf8_validate(p, -1, &pend);
s.append(p, pend);
}
std::cout << s << std::endl; // prints "okdatamorevalid"
}

You can call it in a loop. Something like this:
std::string sanitize_utf8(const std::string &in) {
std::string result;
const char *ptr = in.data(), *end = ptr + in.size();
while (true) {
const char *ptr2;
g_utf8_validate(ptr, end - ptr, &ptr2);
result.append(ptr, ptr2);
if (ptr2 == end)
break;
ptr = ptr2 + 1;
}
return result;
}

Related

longest palindromic substring. Error: AddressSanitizer, heap overflow

#include<string>
#include<cstring>
class Solution {
void shift_left(char* c, const short unsigned int bits) {
const unsigned short int size = sizeof(c);
memmove(c, c+bits, size - bits);
memset(c+size-bits, 0, bits);
}
public:
string longestPalindrome(string s) {
char* output = new char[s.length()];
output[0] = s[0];
string res = "";
char* n = output;
auto e = s.begin() + 1;
while(e != s.end()) {
char letter = *e;
char* c = n;
(*++n) = letter;
if((letter != *c) && (c == &output[0] || letter != (*--c)) ) {
++e;
continue;
}
while((++e) != s.end() && c != &output[0]) {
if((letter = *e) != (*--c)) {
const unsigned short int bits = c - output + 1;
shift_left(output, bits);
n -= bits;
break;
}
(*++n) = letter;
}
string temp(output);
res = temp.length() > res.length()? temp : res;
shift_left(output, 1);
--n;
}
return res;
}
};
input string longestPalindrome("babad");
the program works fine and prints out "bab" as the longest palindrome but there's a heap overflow somewhere. Error like this appears:
Read of size 6 at ...memory address... thread T0
"babad" is size 5 and after going over this for an hour. I don't see the point where the iteration ever exceeds 5
There is 3 pointers here that iterate.
e as the element of string s.
n which is the pointer to the next char of output.
and c which is a copy of n and decrements until it reaches the address of &output[0].
maybe it's something with the memmove or memset since I've never used it before.
I'm completely lost
TL;DR : mixture of char* and std::string are not really good idea if you don't understand how exactly it works.
If you want to length of string you cant do this const unsigned short int size = sizeof(c); (sizeof will return size of pointer (which is commonly 4 on 32-bit machine and 8 on 64-bit machine). You must do this instead: const size_t size = strlen(c);
Address sanitizers is right that you (indirectly) are trying to get an memory which not belongs to you.
How does constructor of string from char* works?
Answer: char* is considered as c-style string, which means that it must be null '\0' terminated.
More details: constructor of string from char* calls strlen-like function which looks like about this:
https://en.cppreference.com/w/cpp/string/byte/strlen
int strlen(char *begin){
int k = 0;
while (*begin != '\0'){
++k;
++begin;
}
return k;
}
If c-style char* string does not contain '\0' it cause accessing memory which doesn't belongs to you.
How to fix?
Answer (two options):
not use mixture of char* and std::string
char* output = new char[s.length()]; replace with char* output = new char[s.length() + 1]; memset(output, 0, s.length() + 1);
Also you must delete all memory which you newed. So add delete[] output; before return res;

How to split a string by another string in Arduino?

I have a character array like below:
char array[] = "AAAA... A1... 3. B1.";
How can I split this array by the string "..." in Arduino? I have tried:
ptr = strtok(array, "...");
and the output is the following:
AAAA,
A1,
3,
B1
But I actually want output to be
AAAA,
A1,
3.B1.
How to get this output?
edit:
My full code is this:
char array[] = "AAAA... A1... 3. B1.";
char *strings[10];
char *ptr = NULL;`enter code here`
void setup()
{
Serial.begin(9600);
byte index = 0;
ptr = strtok(array, "..."); // takes a list of delimiters
while(ptr != NULL)
{
strings[index] = ptr;
index++;
ptr = strtok(NULL, "..."); // takes a list of delimiters
}
for(int n = 0; n < index; n++)
{
Serial.println(strings[n]);
}
}
The main problem is that strtok does not find a string inside another string. strtok looks for a character in a string. When you give multiple characters to strtok it looks for any of these. Consequently, writing strtok(array, "..."); is exactly the same as writing strtok(array, ".");. That is why you get a split after "3."
There are multiple ways of doing what you want. Below I'll show you an example using strstr. Unlike strtokthe strstr function do find a substring inside a string - just what you are looking for. But.. strstr is not a tokenizer so some extra code is required to print the substrings.
Something like this should do:
int main()
{
char array[] = "AAAA... A1... 3. B1...";
char* ps = array;
char* pf = strstr(ps, "..."); // Find first substring
while(pf)
{
int len = pf - ps; // Number of chars to print
printf("%.*s\n", len, ps);
ps = pf + 3;
pf = strstr(ps, "..."); // Find next substring
}
return 0;
}
You can implement your own split as strtok except the role of the second argument :
#include <stdio.h>
#include <string.h>
char * split(char *str, const char * delim)
{
static char * s;
char * p, * r;
if (str != NULL)
s = str;
p = strstr(s, delim);
if (p == NULL) {
if (*s == 0)
return NULL;
r = s;
s += strlen(s);
return r;
}
r = s;
*p = 0;
s = p + strlen(delim);
return r;
}
int main()
{
char s[] = "AAAA... A1... 3. B1.";
char * p = s;
char * t;
while ((t = split(p, "...")) != NULL) {
printf("'%s'\n", t);
p = NULL;
}
return 0;
}
Compilation and execution:
/tmp % gcc -g -pedantic -Wextra s.c
/tmp % ./a.out
'AAAA'
' A1'
' 3. B1.'
/tmp %
I print between '' to show the return spaces, because I am not sure you want them, so delim is not only ... in that case
Because you tagged this as c++, here is a c++ 'version' of your code:
#include <iostream>
using std::cout;
using std::endl;
#include <vector>
using std::vector;
#include <string>
using std::string;
class T965_t
{
string array;
vector<string> strings;
public:
T965_t() : array("AAAA... A1... 3. B1.")
{
strings.reserve(10);
}
~T965_t() = default;
int operator()() { return setup(); } // functor entry
private: // methods
int setup()
{
cout << endl;
const string pat1 ("... ");
string s1 = array; // working copy
size_t indx = s1.find(pat1, 0); // find first ... pattern
// start search at ---------^
do
{
if (string::npos == indx) // pattern not found
{
strings.push_back (s1); // capture 'remainder' of s1
break; // not found, kick out
}
// else
// extract --------vvvvvvvvvvvvvvvvv
strings.push_back (s1.substr(0, indx)); // capture
// capture to vector
indx += pat1.size(); // i.e. 4
s1.erase(0, indx); // erase previous capture
indx = s1.find(pat1, 0); // find next
} while(true);
for(uint n = 0; n < strings.size(); n++)
cout << strings[n] << "\n";
cout << endl;
return 0;
}
}; // class T965_t
int main(int , char**) { return T965_t()(); } // call functor
With output:
AAAA
A1
3. B1.
Note: I leave changing "3. B1." to "3.B1.", and adding commas at end of each line (except the last) as an exercise for the OP if required.
I looked for a split function and I didn't find one that meets my requirement, so I made one and it works for me so far, of course in the future I will make some improvements, but it got me out of trouble.
But there is also the strtok function and better use that.
https://www.delftstack.com/es/howto/arduino/arduino-strtok/
I have the split function
Arduino code:
void split(String * vecSplit, int dimArray,String content,char separator){
if(content.length()==0)
return;
content = content + separator;
int countVec = 0;
int posSep = 0;
int posInit = 0;
while(countVec<dimArray){
posSep = content.indexOf(separator,posSep);
if(posSep<0){
return;
}
countVec++;
String splitStr = content.substring(posInit,posSep);
posSep = posSep+1;
posInit = posSep;
vecSplit[countVec] = splitStr;
countVec++;
}
}
Llamada a funcion:
smsContent = "APN:4g.entel;DOMAIN:domolin.com;DELAY_GPS:60";
String vecSplit[10];
split(vecSplit,10,smsContent,';');
for(int i = 0;i<10;i++){
Serial.println(vecSplit[i]);
}
String input:
APN:4gentel;DOMAIN:domolin.com;DELAY_GPS:60
Output:
APN:4g.entel
DOMAIN:domolin.com
DELAY_GPS:60
RESET:true
enter image description here

Read CString from buffer with unknown length?

Let's say I have a file. I read all the bytes into an unsigned char buffer. From there I'm trying to read a c string (null terminated) without knowing it's length.
I tried the following:
char* Stream::ReadCString()
{
char str[0x10000];
int len = 0;
char* pos = (char*)(this->buffer[this->position]);
while(*pos != 0)
str[len++] = *pos++;
this->position += len+ 1;
return str;
}
I thought I could fill up each char in the str array as I went through, checking if the char was null terminated or not. This is not working. Any help?
this->buffer = array of bytes
this->position = position in the array
Are there any other methods to do this? I guess I could run it by the address of the actual buffer:
str[len++] = *(char*)(this->buffer[this->position++]) ?
Update:
My new function:
char* Stream::ReadCString()
{
this->AdvPosition(strlen((char*)&(this->buffer[this->position])) + 1);
return (char*)&(this->buffer[this->position]);
}
and calling it with:
printf( "String: %s\n", s.ReadCString()); //tried casting to char* as well just outputs blank string
Example File:
Check this:
#include <cstring>
#include <iostream>
class A
{
unsigned char buffer[4096];
int position;
public:
A() : position(0)
{
memset(buffer, 0, 4096);
char *pos = reinterpret_cast<char*>(&(this->buffer[50]));
strcpy(pos, "String");
pos = reinterpret_cast<char*>(&(this->buffer[100]));
strcpy(pos, "An other string");
}
const char *ReadString()
{
if (this->position != 4096)
{
while (std::isalpha(this->buffer[this->position]) == false && this->position != 4096)
this->position++;
if (this->position == 4096)
return 0;
void *tmp = &(this->buffer[this->position]);
char *str = static_cast<char *>(tmp);
this->position += strlen(str);
return (str);
}
return 0;
}
};
The reintrepret_cast are only for the init, since you are reading from a file
int main()
{
A test;
std::cout << test.ReadString() << std::endl;
std::cout << test.ReadString() << std::endl;
std::cout << test.ReadString() << std::endl;
}
http://ideone.com/LcPdFD
Edit I have changed the end of ReadString()
str is a local c string. Any referencing pointer to str outsider the function is undefined behavior: Undefined, unspecified and implementation-defined behavior, it might or might not cause notable problem.
Null termination is probably the best way to go as long as you're careful, but the reason its not working for you is most likely because you are returning memory that has been allocated on the stack. This memory is going to be freed as soon as you hit the return which will therefore cause undefined behaviour. Instead, allocate your chars on the heap:
char* str = new char[0x10000];
and free the memory when the caller doesn't need it anymore.
It can be fixed with the following method. I was advancing the position, and then returning the address.
char* Stream::ReadCString()
{
u64 str_len = strlen((char*)&(this->buffer[this->position])) + 1;
this->AdvPosition(str_len);
return (char*)&(this->buffer[this->position - str_len]);
}
Hope this helps anyone.

I get a number 2 when I reverse my string

I wrote this code to reverse strings. It works well, but when I enter short strings like "american beauty," it actually prints "ytuaeb nacirema2." This is my code. I would like to know what is wrong with my code that prints a random 2 at the end of the string. Thanks
// This program prompts the user to enter a string and displays it backwards.
#include <iostream>
#include <cstdlib>
using namespace std;
void printBackwards(char *strPtr); // Function prototype
int main() {
const int SIZE = 50;
char userString[SIZE];
char *strPtr;
cout << "Please enter a string (up to 49 characters)";
cin.getline(userString, SIZE);
printBackwards(userString);
}
//**************************************************************
// Definition of printBackwards. This function receives a *
// pointer to character and inverts the order of the characters*
// within it. *
//**************************************************************
void printBackwards(char *strPtr) {
const int SIZE = 50;
int length = 0;
char stringInverted[SIZE];
int count = 0;
char *strPtr1 = 0;
int stringSize;
int i = 0;
int sum = 0;
while (*strPtr != '\0') {
strPtr++; // Set the pointer at the end of the string.
sum++; // Add to sum.
}
strPtr--;
// Save the contents of strPtr on stringInverted on inverted order
while (count < sum) {
stringInverted[count] = *strPtr;
strPtr--;
count++;
}
// Add '\0' at the end of stringSize
stringInverted[count] == '\0';
cout << stringInverted << endl;
}
Thanks.
Your null termination is wrong. You're using == instead of =. You need to change:
stringInverted[count] == '\0';
into
stringInverted[count] = '\0';
// Add '\0' at the end of stringSize
stringInverted[count] == '\0';
Should use = here.
What is wrong with your code is that you do not even use strlen for counting the length of the string and you use fixed size strings (no malloc, or, gasp new[]), or the std::string (this is C++)! Even in plain C, not using strlen is always wrong because it is hand-optimized for the processor. What is worst, you have allocated the string to be returned (stringInverted) from the stack frame, which means when the function exits, the pointer is invalid and any time the code "works" is purely accidental.
To reverse a string on c++ you do this:
#include <iostream>
#include <string>
int main() {
std::string s = "asdfasdf";
std::string reversed (s.rbegin(), s.rend());
std::cout << reversed << std::endl;
}
To reverse a string in C99 you do this:
char *reverse(const char *string) {
int length = strlen(string);
char *rv = (char*)malloc(length + 1);
char *end = rv + length;
*end-- = 0;
for ( ; end >= rv; end --, string ++) {
*end = *string;
}
return rv;
}
and remember to free the returned pointer after use. All other answers so far are blatantly wrong :)

Char array sorting and removing duplicates

I am trying to do some array manipulations.
I am doing char array sorting and duplicates removal here.
Your comments are welcome. Havent done much testing and error handling here though.
#include<stdafx.h>
#include<stdlib.h>
#include<stdio.h>
#include<string>
using namespace std;
void sort(char *& arr)
{
char temp;
for(int i=0;i<strlen(arr);i++)
{
for(int j=i+1;j<strlen(arr);j++)
{
if(arr[i] > arr[j])
{
temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
}
}
}
bool ispresent(char *uniqueArr, char * arr)
{
bool isfound = false;
for(int i=0;i<strlen(arr);i++)
{
for(int j=0;j<=strlen(uniqueArr);j++)
{
if(arr[i]== uniqueArr[j])
{
isfound = true;
return isfound;
}
else
isfound = false;
}
}
return isfound;
}
char * removeduplicates(char *&arr)
{
char * uniqqueArr = strdup(""); // To make this char array modifiable
int index = 0;
bool dup = false;
while(*arr!=NULL)
{
dup = ispresent(uniqqueArr, arr);
if(dup == true)
{}//do nothing
else// copy the char to new char array.
{
uniqqueArr[index] = *arr;
index++;
}
arr++;
}
return uniqqueArr;
}
int main()
{
char *arr = strdup("saaangeetha");
// if strdup() is not used , access violation writing to
//location occurs at arr[i] = arr[j].
//This makes the constant string modifiable
sort(arr);
char * uniqueArr = removeduplicates(arr);
}
If you use std::string, your code (which is actually C-Style) can be written in C++ Style in just these lines:
#include <iostream>
#include <string>
#include <algorithm>
int main() {
std::string s= "saaangeetha";
std::sort(s.begin(), s.end());
std::string::iterator it = std::unique (s.begin(), s.end());
s.resize( it - s.begin());
std::cout << s ;
return 0;
}
Output: (all duplicates removed)
aeghnst
Demo : http://ideone.com/pHpPh
If you want char* at the end, then you can do this:
const char *uniqueChars = s.c_str(); //after removing the duplicates!
If I were doing it, I think I'd do the job quite a bit differently. If you can afford to ignore IBM mainframes, I'd do something like this:
unsigned long bitset = 0;
char *arr = "saaangeetha";
char *pos;
for (pos=arr; *pos; ++pos)
if (isalpha(*pos))
bitset |= 1 << (tolower(*pos)-'a');
This associates one bit in bitset with each possible letter. It then walks through the string and for each letter in the string, sets the associated bit in bitset. To print out the letters once you're done, you'd walk through bitset and print out the associated letter if that bit was set.
If you do care about IBM mainframes, you can add a small lookup table:
static char const *letters = "abcdefghijklkmnopqrstuvwxyz";
and use strchr to find the correct position for each letter.
Edit: If you're using C++ rather than C (as the tag said when I wrote what's above), you can simplify the code a bit at the expense of using some extra storage (and probably being minutely slower):
std::string arr = "saaangeetha";
std::set<char> letters((arr.begin()), arr.end());
std::copy(letters.begin(), letters.end(), std::ostream_iterator<char>(std::cout, " "));
Note, however, that while these appear the same for the test input, they can behave differently -- the previous version screens out anything but letters (and converts them all to lower case), but this distinguishes upper from lower case, and shows all non-alphabetic characters in the output as well.
char *arr = "saangeetha";
arr is pointing to read only section where string literal saangeetha is stored. So, it cannot be modified and is the reason for access violation error. Instead you need to do -
char arr[] = "sangeetha"; // Now, the string literal can be modified because a copy is made.