Trimming UTF8 buffer

Trimming UTF8 buffer - c++

I have a buffer with UTF8 data. I need to remove the leading and trailing spaces.
Here is the C code which does it (in place) for ASCII buffer:
char *trim(char *s)
{
while( isspace(*s) )
memmove( s, s+1, strlen(s) );
while( *s && isspace(s[strlen(s)-1]) )
s[strlen(s)-1] = 0;
return s;
}
How to do the same for UTF8 buffer in C/C++?
P.S.
Thanks for perfomance tip regarding strlen(). Back to UTF8 specific: what if I need to remove all spaces all together, not only at beginning and at the tail? Also I may need to remove all characters with ASCII code <32. Is any specific here for UTF8 case, like using mbstowcs()?

Do you want to remove all of the various Unicode spaces too, or just ASCII spaces? In the latter case you don't need to modify the code at all.
In any case, the method you're using that repeatedly calls strlen is extremely inefficient. It turns a simple O(n) operation into at least O(n^2).
Edit: Here's some code for your updated problem, assuming you only want to strip ASCII spaces and control characters:
unsigned char *in, *out;
for (out = in; *in; in++) if (*in > 32) *out++ = *in;
*out = 0;

strlen() scans to the end of the string, so calling it multiple times, as in your code, is very inefficient.
Try looking for the first non-space and the last non-space and then memmove the substring:
char *trim(char *s)
{
char *first;
char *last;
first = s;
while(isspace(*first))
++first;
last = first + strlen(first) - 1;
while(last > first && isspace(*last))
--last;
memmove(s, first, last - first + 1);
s[last - first + 1] = '\0';
return s;
}
Also remember that the code modifies its argument.

Related

Use C++, How to change the above multiple \n into only one \n?

CString str = _T("111\n\n\n222");
How to change the above multiple \n into only one \n?
Cannot use Replace directly, because the number of \n is not fixed

while (str.Replace("\n\n", "\n") > 0)
;

You can use CString::GetBuffer to obtain a buffer that you can modify. The corresponding CString::ReleaseBuffer allows you to specify a new length for the string.
If you want to remove consecutive characters, you can do this easily by simply walking through the string and rewriting its characters. Any time you see a character that you wish to remove, simply don't write it and don't update the end-position of the string.
Here's a general-purpose function to remove some number of consecutive characters from a CString:
void LimitConsecutiveCharacters(CString& str, TCHAR ch, int maxConsecutive = 1)
{
LPTSTR *begin = str.GetBuffer(0);
LPTSTR *end = begin;
int consecutive = 0;
for (LPTSTR *pos = begin; *pos != _T('\0'); ++pos)
{
if (*pos == ch)
{
if (consecutive >= maxConsecutive)
continue;
++consecutive;
}
else
{
consecutive = 0;
}
*end++ = *pos;
}
int newLength = end - begin;
str.ReleaseBuffer(newLength);
}
As you can see above, it keeps a count of how many consecutive values it has seen for the target character. If the maximum number of consecutive characters is reached, then it simply moves to the next loop iteration. Any time it sees some other character, the "consecutive" count resets.
The end tracks the position that is being written to, which might even be the same position you're reading from, if you've not removed any characters. At the end, some simple pointer arithmetic calculates the new string length and calls CString::ReleaseBuffer.
An example invocation would be:
CString str = _T("111\n\n\n222");
LimitConsecutiveCharacters(str, _T('\n'));

You can convert your CString into a std::wstring, use regex_replace and then convert back to CString.
The patterns for the regular expression would be something like:
find what: L"\n+"
replace by: L"\n"

How to pad char array with empty spaces on left and right hand side of the text

I am fairly new with C++ so for some people the answer to the quesiton I have might seem quite obvious.
What I want to achieve is to create a method which would return the given char array fill with empty spaces before and after it in order to meet certain length. So the effect at the end would be as if the given char array would be in the middle of the other, bigger char array.
Lets say we have a char array with HelloWorld!
I want the method to return me a new char array with the length specified beforehand and the given char array "positioned" in the middle of returning char array.
char ch[] = "HelloWorld";
char ret[20]; // lets say we want to have the resulting char array the length of 20 chars
char ret[20] = " HelloWorld "; // this is the result to be expected as return of the method
In case of odd number of given char array would like for it to be in offset of one space on the left of the middle.
I would also like to avoid any memory consuming strings or any other methods that are not in standard library - keep it as plain as possible.
What would be the best way to tackle this issue? Thanks!

There are mainly two ways of doing this: either using char literals (aka char arrays), like you would do in C language or using built-in std::string type (or similar types), which is the usual choice if you're programming in C++, despite there are exceptions.
I'm providing you one example for each.
First, using arrays, you will need to include cstring header to use built-in string literals manipulation functions. Keep in mind that, as part of the length of it, a char array always terminates with the null terminator character '\0' (ASCII code is 0), therefore for a DIM-dimensioned string you will be able to store your characters in DIM - 1 positions. Here is the code with comments.
constexpr int DIM = 20;
char ch[] = "HelloWorld";
char ret[DIM] = "";
auto len_ch = std::strlen(ch); // length of ch without '\0' char
auto n_blanks = DIM - len_ch - 1; // number of blank chars needed
auto half_n_blanks = n_blanks / 2; // half of that
// fill in from begin and end of ret with blanks
for (auto i = 0u; i < half_n_blanks; i++)
ret[i] = ret[DIM - i - 2] = ' ';
// copy ch content into ret starting from half_n_blanks position
memcpy_s(
ret + half_n_blanks, // start inserting from here
DIM - half_n_blanks, // length from our position to the end of ret
ch, // string we need to copy
len_ch); // length of ch
// if odd, after ch copied chars
// there will be a space left to insert a blank in
if (n_blanks % 2 == 1)
*(ret + half_n_blanks + len_ch) = ' ';
I chose first to insert blank spaces both to the begin and to the end of the string and then to copy the content of ch.
The second approach is far easier (to code and to understand). The max characters size a std::string (defined in header string) can contain is std::npos, which is the max number you can have for the type std::size_t (usually a typedef for unsigned int). Basically, you don't have to worry about a std::string max length.
std::string ch = "HelloWorld", ret;
auto ret_max_length = 20;
auto n_blanks = ret_max_length - ch.size();
// insert blanks at the beginning
ret.append(n_blanks / 2, ' ');
// append ch
ret += ch;
// insert blanks after ch
// if odd, simply add 1 to the number of blanks
ret.append(n_blanks / 2 + n_blanks % 2, ' ');
The approach I took here is different, as you can see.
Notice that, because of '\0', the result of these two methods are NOT the same. If you want to obtain the same behaviour, you may either add 1 to DIM or subtract 1 from ret_max_length.

Assuming that we know the size, s, of the array, ret and knowing that the last character of any char array is '\0', we find the length, l, of the input char array, ch.
int l = 0;
int i;
for(i=0; ch[i]!='\0'; i++){
l++;
}
Then we compute how many spaces we need on either side. If total_space is even, then there are equal spaces on either side. Otherwise, we can choose which side will have the extra space, in this case, the left side.
int total_spaces = size-l-1; // subtract by 1 to adjust for '\0' character
int spaces_right = 0, spaces_left = 0;
if((total_spaces%2) == 0){
spaces_left = total_spaces/2;
spaces_right = total_spaces/2;
}
else{
spaces_left = total_spaces/2;
spaces_right = (total_spaces/2)+1;
}
Then first add the left_spaces, then the input array, ch, and then the right_spaces to ret.
i=0;
while(spaces_left > 0){
ret[i] = ' ';
spaces_left--;
i++;
} // add spaces
ret[i] = '\0';
strcat(ret, ch); // concatenate ch to ret
while(spaces_right){
ret[i] = ' ';
spaces_right--;
i++;
}
ret[i] = '\0';
Make sure to include <cstring> to use strcat().

garbage characters in buffer

I have this function.
void cast(char *buf)
{
string str(buf);
string s=str.substr(0,5);
std::transform(s.begin(), s.end(), s.begin(),::toupper);
DemoInput=s;
}
The *buf is a message that the client sends. I'm trying to take that message and no matter how long it is strip it to five characters and make it uppercase. This works if the message > 5 but if the message < 5 then there are garbage characters at the end of it.
ex: if buf is "long" then DemoInput becomes "LONG\\r"
I thought about using regex ("[:upper:]") but think there must be an easier way to do this.
I find posix regex a bit more complicated then python regex for example.

If you only need the first 5 characters, don't copy the whole of buf. That just wastes space and time. Also, you shouldn't copy anything past the telnet control character \r.
void cast(char *buf)
{
size_t len = 0;
while (len < 5 && buf[len] != '\0' && buf[len] != '\r') {
++len;
}
string s(buf, len);
std::transform(s.begin(), s.end(), s.begin(),::toupper);
DemoInput=s;
}

Why don't you change the code supplying the buf to the cast function. Append '\0' to signify end of string as it sounds though it may not be null terminated.

Parsing a character array with several null terminated characters into different strings - C++

I asked this question before but with less information than I have now.
What I essentially have is a data block of type char. That block contains filenames that I need to format and put into a vector. I initially thought the formation of this char block had three spaces between each filename. Now, I realize they are '/0' null terminated characters. So the solution that was provided was fantastic for the example I gave when I thought that there were spaces rather than null chars.
Here is what the structure looks like. Also, I should point out I DO have the size of the character data block.
filename1.bmp/0/0/0brick.bmp/0/0/0toothpaste.gif/0/0/0
The way the best solution did it was this:
// The stringstream will do the dirty work and deal with the spaces.
std::istringstream iss(s);
// Your filenames will be put into this vector.
std::vector<std::string> v;
// Copy every filename to a vector.
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(v));
// They are now in the vector, print them or do whatever you want with them!
for(int i = 0; i < v.size(); ++i)
std::cout << v[i] << "\n";
This works fantastic for my original question but not with the fact they are null chars instead of spaces. Is there any way to make the above example work. I tried replacing null chars in the array with spaces but that didn't work.
Any ideas on the best way to format this char block into a vector of strings?
Thanks.

If you know your filenames don't have embedded "\0" characters in them, then this should work. (untested)
const char * buffer = "filename1.bmp/0/0/0brick.bmp/0/0/0toothpaste.gif/0/0/0";
int size_of_buffer = 1234; //Or whatever the real value is
const char * end_of_buffer = buffer + size_of_buffer;
std::vector<std::string> v;
while( buffer!=end_of_buffer)
{
v.push_back( std::string(buffer) );
buffer = buffer+filename1.size()+3;
}
If they do have embedded null characters in the filename you'll need to be a little cleverer.
Something like this should work. (untested)
char * start_of_filename = buffer;
while( start_of_filename != end_of_buffer )
{
//Create a cursor at the current spot and move cursor until we hit three nulls
char * scan_cursor = buffer;
while( scan_cursor[0]!='\0' && scan_cursor[1]!='\0' && scan_cursor[2]!='\0' )
{
++scan_cursor;
}
//From our start to the cursor is our word.
v.push_back( std::string(start_of_filename,scan_cursor) );
//Move on to the next word
start_of_filename = scan_cursor+3;
}

If spaces would be a suitable separator, you could just replace the null characters by spaces:
std::replace(std::begin(), std::end(), 0, ' ');
... and go from there. However, I'd suspect that you really need to use the null characters as separators as file names typically can include spaces. In this case, you could either use std::getline() with '\0' as the end of line or use the find() and substr() members of the string itself. The latter would look something like this:
std::vector<std::string> v;
std::string const null(1, '\0');
for (std::string::size_type pos(0); (pos = s.find_first_not_of(null, pos)) != s.npos; )
{
end = s.find(null, pos);
v.push_back(s.substr(0, end - pos));
pos = end;
}

reading buffer C++

I'm trying to read buffer in C++ one character at the time until '\n', and initialize char array with these characters using do-while loop. I know I could use cin.getline(), but I want to try it on my own.
int main()
{
char buffer [1024];
int index = 0;
char temp;
do
{
cin.get( temp );
buffer [ index ] = temp;
index ++;
}
while ( temp != '\n' );
cout << buffer << endl;
return 0;
}
It gives me incorrect result-the proper text fallow by couple of lines of squre brackets mixed with other weird symbols.

At first, after whole text you have to append '\0' as end of string
it should look like buffer[ index ] = 0; because you should rewrite your \n character which you append too.
Of course, there are other things which you should check but they are not your main problem
length of your input because you have limited buffer - max length is 1023 + null byte
end of standard input cin.eof()

You're not null-delimiting your buffer.
Try to change the first line to
char buffer[1024] = "";
This will set all characters in buffer to 0. Or, alternatively, set only the last character to 0, by doing
buffer[index] = 0;
after the loop.
Also, (as correctly pointed by others) if the text is longer than 1024 characters, you'll have a buffer overrun error - one of the most often exploited causes for security issues in software.

Two things:
If the length of the line you are
reading exceeds 1024 you write past
the buffer which is bad.
If the length is within the
limit,you are not terminating the
string with null char.
You can trying doing it the following way. This way if you find a fine exceeding the buffer size, we truncate it and also add the null char at the end ouside the loop.
#define MAX 1024
int main()
{
char buffer [MAX];
int index = 0;
char temp;
do
{
// buffer full.
if(index == MAX-1)
break;
cin.get( temp );
buffer [ index ] = temp;
index ++;
}
while ( temp != '\n' );
// add null char at the end.
buffer[index] = '\0';
cout << buffer << endl;
return 0;
}

Several issues I noted:
(1) What character encoding is the input. You could be reading 8,16, or 32 bit characters. Are you sure you're reading ASCII?
(2) You are searching for '\n' the end of line character could be '\r\n' or '\r' or '\n' depending on your platform. Perhaps the \r character by itself is your square bracket?

You stop filling the buffer when you get to a newline, so the rest is uninitialised. You can zero-initialise your buffer by defining it with: char buffer[1024] = {0}; This will fix your problem.

You are not putting a '\0' at the end of the string. Additionally, you should really check for buffer overflow conditions. Stop reading when index gets to 1024.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trimming UTF8 buffer - c++

Related

Use C++, How to change the above multiple \n into only one \n?

How to pad char array with empty spaces on left and right hand side of the text

garbage characters in buffer

Parsing a character array with several null terminated characters into different strings - C++

reading buffer C++

Categories

Resources