How to find the length of a string in C++ - c++

I a writing a program and I need to write a function that returns the amount of characters and spaced in a string. I have a string(mystring) that the user writes, I want the function to return the exact amount of letters and spaces in string, for examples "Hello World" should return 11, since there are 10 letters and 1 space. I know string::size exists but this returns the size in bytes, which is of no use to me.

I'm not sure if you want the length of the string in characters or you just want to count the number of letters and spaces.
There is no specific function that lets you count just letters and spaces, however you can get the amount of letters and spaces (and ignore all other types of characters) quite simply:
#include <string>
#include <algorithm>
#include <cctype>
int main() {
std::string mystring = "Hello 123 World";
int l = std::count_if(mystring.begin(), mystring.end(), [](char c){ return isspace(c) || isalpha(c); });
return 0;
}
Otherwise, unless you use non-ascii strings, std::string::length should work for you.
In general, it's not so simple and you're quite right if you assumed that one byte doesn't necessarily mean one character. However, if you're just learning, you don't have to deal with unicode and the accompanying nastiness yet. For now you can assume 1 byte is 1 character, just know that it's not generally true.

Your first aim should be to figure out if the string is ascii encoded or encoded with a multi-byte format.
For ascii string::size would suffice. You could use the length property of string as well.
In the latter case you need to find the number of bytes per character.

You should take the size of your array, in bytes, using string::size and then divide this by the size in bytes of an element of that string (a char).
That would look like: int len = mystring.size() / sizeof(char);
Just make sure to include iostream, the header file that contains std::sizeof.

You can make your own function to get the length of string in C++ (For std::string)
#include <iostream>
#include <cstring>
using namespace std;
int get_len(string str){
int len = 0;
char *ptr;
while(*ptr != '\0')
{
ptr = &str[len];
len++;
}
int f_len = len - 1;
return f_len;
}
To use this function, simply use:
get_len("str");

Related

Forcing format_to_n to use terminating zero

Beside most common (format) function C++20 also comes with format_to_n that takes output iterator and count.
What I am looking for is the way to make sure that in case I ran out of space that my string is still zero terminated.
For example I want the following program to output 4 instead of 42.
#include<string>
#include<iostream>
#define FMT_HEADER_ONLY
#include <fmt/format.h>
void f(char* in){
fmt::format_to_n(in, 2,"{}{}", 42,'\0');
std::cout << in;
}
int main(){
char arr[]= "ABI";
f(arr);
}
Is this possible without me manually doing the comparison of number of written chars and max len I provided to function?
If you are wondering why I use '\0' as an argument:
I have no idea how to put terminating char in format string.
note: I know that for one argument I can specify max len with :. but I would like a solution that works for multiple arguments.
format_to_n returns a result. You can use that struct:
void f(char* in){
auto [out, size] = fmt::format_to_n(in, 2, "{}", 42);
*out = '\0';
std::cout << in;
}
Note that this might write "42\0" into in, so adjust your capacity as appropriate (2 for a buffer of size 3 is correct).
format_to_n returns a struct containing, among other things, the iterator past the last character written. So it's quite easy to simply check the difference between that iterator and the original iterator against the maximum number of characters, and insert a \0 where appropriate:
void f(char* in)
{
const max_chars = 2;
auto fmt_ret = fmt::format_to_n(in, max_chars,"{}", 42);
char *last = fmt_ret.out;
if(last - in == max_chars)
--last;
*last = '\0';
std::cout << in;
}
Note that this assumes that the array only holds exactly the number of characters (including the NUL terminator) as the number you attempted to pass to format_to_n. The above code will therefore overwrite the last character written with a NUL terminator, essentially doing further truncation.
If instead you pass to format_to_n the number of characters in the array - 1, then you can simply always write the NUL terminator to fmt_ret.out itself.

UTF-8, sprintf, strlen, etc

I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/

How to convert a std::string which contains '\0' to a char* array?

I have a string like,
string str="aaa\0bbb";
and I want to copy the value of this string to a char* variable. I tried the following methods but none of them worked.
char *c=new char[7];
memcpy(c,&str[0],7); // c="aaa"
memcpy(c,str.data(),7); // c="aaa"
strcpy(c,str.data()); // c="aaa"
str.copy(c,7); // c="aaa"
How can I copy that string to a char* variable without loosing any data?.
You can do it the following way
#include <iostream>
#include <string>
#include <cstring>
int main()
{
std::string s( "aaa\0bbb", 7 );
char *p = new char[s.size() + 1];
std::memcpy( p, s.c_str(), s.size() );
p[s.size()] = '\0';
size_t n = std::strlen( p );
std::cout << p << std::endl;
std::cout << p + n + 1 << std::endl;
}
The program output is
aaa
bbb
You need to keep somewhere in the program the allocated memory size for the character array equal to s.size() + 1.
If there is no need to keep the "second part" of the object as a string then you may allocate memory of the size s.size() and not append it with the terminating zero.
In fact these methods used by you
memcpy(c,&str[0],7); // c="aaa"
memcpy(c,str.data(),7); // c="aaa"
str.copy(c,7); // c="aaa"
are correct. They copy exactly 7 characters provided that you are not going to append the resulted array with the terminating zero. The problem is that you are trying to output the resulted character array as a string and the used operators output only the characters before the embedded zero character.
Your string consists of 3 characters. You may try to use
using namespace std::literals;
string str="aaa\0bbb"s;
to create string with \0 inside, it will consist of 7 characters
It's still won't help if you will use it as c-string ((const) char*). c-strings can't contain zero character.
There are two things to consider: (1) make sure that str already contains the complete literal (the constructor taking only a char* parameter might truncate at the string terminator char). (2) Provided that str actually contains the complete literal, statement memcpy(c,str.data(),7) should work. The only thing then is how you "view" the result, because if you pass c to printf or cout, then they will stop printing once the first string terminating character is reached.
So: To make sure that your string literal "aaa\0bbb" gets completely copied into str, use std::string str("aaa\0bbb",7); Then, try to print the contents of c in a loop, for example:
std::string str("aaa\0bbb",7);
const char *c = str.data();
for (int i=0; i<7; i++) {
printf("%c", c[i] ? c[i] : '0');
}
You already did (not really, see edit below). The problem however, is that whatever you are using to print the string (printf?), is using the c string convention of ending strings with a '\0'. So it starts reading your data, but when it gets to the 0 it will assume it is done (because it has no other way).
If you want to simply write the buffer to the output, you will have to do this with something like
write(stdout, c, 7);
Now write has information about where the data ends, so it can write all of it.
Note however that your terminal cannot really show a \0 character, so it might show some weird symbol or nothing at all. If you are on linux you can pipe into hexdump to see what the binary output is.
EDIT:
Just realized, that your string also initalizes from const char* by reading until the zero. So you will also have to use a constructor to tell it to read past the zero:
std::string("data\0afterzero", 14);
(there are prettier solutions probably)

My program is giving different output on different machines..!

#include<iostream>
#include<string.h>
#include<stdio.h>
int main()
{
char left[4];
for(int i=0; i<4; i++)
{
left[i]='0';
}
char str[10];
gets(str);
strcat(left,str);
puts(left);
return 0;
}
for any input it should concatenate 0000 with that string, but on one pc it's showing a diamond sign between "0000" and the input string...!
You append a possible nine (or more, gets have no bounds checking) character string to a three character string (which contains four character and no string terminator). No string termination at all. So when you print using puts it will continue to print until it finds a string termination character, which may be anywhere in memory. This is, in short, a school-book example of buffer overflow, and buffer overflows usually leads to undefined behavior which is what you're seeing.
In C and C++ all C-style strings must be terminated. They are terminated by a special character: '\0' (or plain ASCII zero). You also need to provide enough space for destination string in your strcat call.
Proper, working program:
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main(void)
{
/* Size is 4 + 10 + 1, the last +1 for the string terminator */
char left[15] = "0000";
/* The initialization above sets the four first characters to '0'
* and properly terminates it by adding the (invisible) '\0' terminator
* which is included in the literal string.
*/
/* Space for ten characters, plus terminator */
char str[11];
/* Read string from user, with bounds-checking.
* Also check that something was truly read, as `fgets` returns
* `NULL` on error or other failure to read.
*/
if (fgets(str, sizeof(str), stdin) == NULL)
{
/* There might be an error */
if (ferror(stdin))
printf("Error reading input: %s\n", strerror(errno));
return 1;
}
/* Unfortunately `fgets` may leave the newline in the input string
* so we have to remove it.
* This is done by changing the newline to the string terminator.
*
* First check that the newline really is there though. This is done
* by first making sure there is something in the string (using `strlen`)
* and then to check if the last character is a newline. The use of `-1`
* is because strings like arrays starts their indexing at zero.
*/
if (strlen(str) > 0 && str[strlen(str) - 1] == '\n')
str[strlen(str) - 1] = '\0';
/* Here we know that `left` is currently four characters, and that `str`
* is at most ten characters (not including zero terminaton). Since the
* total length allocated for `left` is 15, we know that there is enough
* space in `left` to have `str` added to it.
*/
strcat(left, str);
/* Print the string */
printf("%s\n", left);
return 0;
}
There are two problems in the code.
First, left is not nul-terminated, so strcat will end up looking beyond the end of the array for the appropriate place to append characters. Put a '\0' at the end of the array.
Second, left is not large enough to hold the result of the call to strcat. There has to be enough room for the resulting string, including the nul terminator. So the size of left should at least 4 + 9, to allow for the three characters (plus nul terminator) that left starts out with, and 9 characters coming from str (assuming that gets hasn't caused an overflow).
Each of these errors results in undefined behavior, which accounts for the different results on different platforms.
I do not know why you are bothering to include <iostream> as you aren't using any C++ features in your code. Your entire program would be much shorter if you had:
#include <iostream>
#include <string>
int main()
{
std::string line;
std::cin >> line;
std::cout << "You entered: " << line;
return 0;
}
Since std::string is going to be null-terminated, there is no reason to force it to be 4-null-terminated.
Problem #1 - not a legal string:
char left[4];
for(int i=0; i<4; i++)
{
left[i]='0';
}
String must end with a zero char, '\0' not '0'.
This causes what you describe.
Problem #2 - fgets. You use it on a small buffer. Very dangerous.
Problem #3 - strcat. Yet again trying to fill a super small buffer which should have already been full with an extra string.
This code looks an invitation to a buffer overflow attack.
In C what we call a string is a null terminated character array.All the functions in the string.h library are based on this null at the end of the character array.Your character array is not null terminated and thus is not a string , So you can not use the string library function strcat here.

C++ Convert char array to int representation

What is the best way to convert a char array (containing bytes from a file) into an decimal representation so that it can be converted back later?
E.g "test" -> 18951210 -> "test".
EDITED
It can't be done without a bignum class, since there's more letter combinations possible than integer combinations in an unsigned long long. (unsigned long long will hold about 7-8 characters)
If you have some sort of bignum class:
biguint string_to_biguint(const std::string& s) {
biguint result(0);
for(int i=0; i<s.length(); ++i) {
result *= UCHAR_MAX;
result += (unsigned char)s[i];
}
return result;
}
std::string biguint_to_string(const biguint u) {
std::string result;
do {
result.append(u % UCHAR_MAX)
u /= UCHAR_MAX;
} while (u>0);
return result;
}
Note: the string to uint conversion will lose leading NULLs, and the uint to string conversion will lose trailing NULLs.
I'm not sure what exactly you mean, but characters are stored in memory as their "representation", so you don't need to convert anything. If you still want to, you have to be more specific.
EDIT: You can
Try to read byte by byte shifting the result 8 bits left and oring it
with the next byte.
Try to use mpz_inp_raw
You can use a tree similar to Huffman compression algorithm, and then represent the path in the tree as numbers.
You'll have to keep the dictionary somewhere, but you can just create a constant dictionary that covers the whole ASCII table, since the compression is not the goal here.
There is no conversion needed. You can just use pointers.
Example:
char array[4 * NUMBER];
int *pointer;
Keep in mind that the "length" of pointer is NUMBER.
As mentioned, character strings are already ranges of bytes (and hence easily rendered as decimal numbers) to start with. Number your bytes from 000 to 255 and string them together and you've got a decimal number, for whatever that is worth. It would help if you explained exactly why you would want to be using decimal numbers, specifically, as hex would be easier.
If you care about compression of the underlying arrays forming these numbers for Unicode Strings, you might be interested in:
http://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode
If you want some benefits of compression but still want fast random-access reads and writes within a "packed" number, you might find my "NSTATE" library to be interesting:
http://hostilefork.com/nstate/
For instance, if you just wanted a representation that only acommodated 26 english letters...you could store "test" in:
NstateArray<26> myString (4);
You could read and write the letters without going through a compression or decompression process, in a smaller range of numbers than a conventional string. Works with any radix.
Assuming you want to store the integers(I'm reading as ascii codes) in a string. This will add the leading zeros you will need to get it back into original string. character is a byte with a max value of 255 so it will need three digits in numeric form. It can be done without STL fairly easily too. But why not use tools you have?
#include <iostream>
#include <sstream>
using namespace std;
char array[] = "test";
int main()
{
stringstream out;
string s=array;
out.fill('0');
out.width(3);
for (int i = 0; i < s.size(); ++i)
{
out << (int)s[i];
}
cout << s << " -> " << out.str();
return 0;
}
output:
test -> 116101115116
Added:
change line to
out << (int)s[i] << ",";
output
test -> 116,101,115,116,