why do you need to add one on new char (str.length())? - c++

Code:
string str = "Whats up";
char *c = new char[str.length() + 1];
I can still write char *c = new char[str.length()];
What is the point of adding +1 on length?

Your code:
string str = "Whats up";
char *c = new char[str.length() + 1];
Your question:
What is the point of adding +1 on length?
The real question should be: What is the point of using C-style strings at all in your C++ program? Are you sure you need them?
Let me explain what exactly happens in your two code lines:
"Whats up" is a string literal, i.e. a constant series of characters, a char const[9] to be precise. The 9th character is the null character, '\0', automatically added by the compiler. So the array actually looks like this:
{ 'W', 'h', 'a', 't', 's', ' ', 'u', 'p', '\0' }
In fact, you could as well write:
char const array[9] = { 'W', 'h', 'a', 't', 's', ' ', 'u', 'p', '\0' };
std::string s = array;
So you have a char const[9] array which is used to initialize a std::string. Which constructor of std::string is actually used here? If you take a look at http://en.cppreference.com/w/cpp/string/basic_string/basic_string, you will find this one:
basic_string( const CharT* s,
const Allocator& alloc = Allocator() );
Remember, std::string is actually a typedef for std::basic_string<char>, so your CharT in this case is a char, and the constructor reads as:
string( const char* s,
const Allocator& alloc = Allocator() );
Also ignore the alloc parameter. It's too complicated to explain to a beginner, and it has a default argument precisely so that you can ignore it almost all the time. Which means that you end up with:
string( const char* s);
Which is itself another way of writing:
string(char const *s);
So you can initialize std::string with a char const *, and your code passes the constructor a char const[9]. This works because the array is automatically converted to a pointer to its first element.
So std::string takes your array, treats it as a pointer and copies the 9 characters. The array size information, 9, is lost, but it doesn't matter, because you have the terminating '\0', so the std::string knows where to stop.
So far, so good. You have a std::string object which contains a copy of "Whats up". Your next line goes like this:
char *c = new char[str.length() + 1];
First of all, consider str.length(). The length function returns string size, not array size. So although you passed 9 characters to construct the string, length returns 8. This makes sense, because std::string is designed to let you forget about pointers, arrays and memory operations. It's text, and the text here has 8 characters.
Thus, str.length() + 1 equals 8 + 1 = 9, so your line of code is equivalent to:
char *c = new char[9];
You have created a pointer named c, initialised to point to a memory location where there is enough room for 9 characters, although what's currently stored there is undefined, so you must not try to read from there yet:
c
|
|
+------+
|
v
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
...| | | | | | | | | | | | ...
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
0 1 2 3 4 5 6 7 8
And there is no relationship between the std::string you created and the memory c points to. They live in completely different places:
c
|
|
+------+
|
v 0 1 2 3 4 5 6 7 8
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
... | | | | | | | | | | | | ... |W |h |a |t |s | |u |p |\0| ...
+-+-+-+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
0 1 2 3 4 5 6 7 8 ^
|
|
str -------( c_str() )-----------+
But if you use a C function like strcpy to copy the contents of the std::string to those 9 characters, then it becomes clear why you need space for 9 characters:
strcpy(c, str.c_str());
strcpy looks at the source (str.c_str()) and copies one character after the other to c until it finds '\0'. str internally ends with \0, so all is good. The function goes from 0 to 8 on the right of this picture and copies everything to 0 to 8 on the left.
And this finally answers your question: There must be space for 9 characters on the left. Otherwise, strcpy will attempt to write the final character (\0) to a memory location you are not allowed to touch. Which results in undefined behaviour and may cause e.g. crashes or random crashes.
With room for 9 characters, strcpy finished successfully:
c
|
|
+------+
|
v 0 1 2 3 4 5 6 7 8
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
... | |W |h |a |t |s | |u |p |\0| | ... |W |h |a |t |s | |u |p |\0| ...
+-+-+-+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
0 1 2 3 4 5 6 7 8 ^
|
|
str -------( c_str() )-----------+
Moral of the story:
Use std::string. Copying a std::string may use very similar mechanism inside but frees you (among other annoying things) from having to remember the "+ 1" rule:
std::string s1 = "Whats up";
std::string s2 = "...";
s2 = s1;

Unlike std::string, C-style strings use a special character to indicate its end, which is the null character '\0', the extra one character is used to store the terminating '\0'.

There is a flaw in your code.
It should be
c* = new char[str.length()+1];
s.length()+1 wouldn't do anything.
Although the compiler will automatically set the c string size for you, it's good practice to specify the exact size so you get to see the mechanics of everything.
C strings always need one more space than the std::string value because c strings are character arrays that have a terminating null value at the end of the array. That is why you always give room for the NULL at the end.

Related

Convert string to portable filename with <filesystem> or Boost.Filesystem

Is there a simple way, with <filesystem> or <boost/filesystem.hpp> to convert a sequence of bytes, perhaps represented by std::vector<char> into a portable filename string such that the result can be converted back to the input sequence?
As an example, if a platform permits a filename to be comprised of characters from ranging from [a,f] and [0,9]. A conversion function that suits the above constraint might be one that simply outputs each character in it's two-digit hex equivalent, so {'a', 'b'} would become "6768" as 'a' -> 97 -> 0x67, and 'b' -> 98 -> 0x68.
This is very simple to do with filesystem::path. The first step is to construct a path object from the sequence of characters. There are two-iterator constructors available, as well as constructors that take any of C++'s character types for encoding purposes.
Then, just call generic_u8string on that path object; you will get a std::u8string (in C++20; in C++17, you get a std::string) containing the path formatted in a platform-neutral generic format. This string can later be used to reconstitute the path object as well.
Now, full round-tripping from the platform-specific format through path back to the platform-specific format is not really permitted. You can get the native string version of the path (path::u8string returns this), but there's no guarantee of a byte-for-byte identical string. There is a guarantee that the two strings will identify the same filesystem resource. So the differences, if they exist, are unimportant.
It took me several days to write this :/.
My personal objective here was to following:
Each unique input string must result in a unique filename. In other words, the conversion must be "one-to-one" which implies that it is also reversible, as was requested by the OP.
As many characters as possible must be kept the same; at least, the filename should be mostly human readable and look much like the original string.
When nothing else goes, characters should be escaped as is usual for url-encoding: a percentage followed by the byte value in hexadecimal.
The escape character (%) itself is escaped with two of them (%%), unless another translation is requested.
I wanted simple things, like the ability to have spaces replaced by underscores; and since underscores might also occur frequently, I don't want those escaped with %2F, but with something neat, like a (multi-byte) unicode character.
I decided to only support UTF8 strings therefore, and be able to treat any utf8 glyph as a translatable 'character'; that is: you can translate single glyphs into different glyphs (including all 1-byte ASCII values).
It is not possible to translate multi-glyph sequences with my implementation, but since it is based on a Dictionary class, most of the code should be reusable if anyone wants to add support for that.
Making sure that every string, under any possible translation is reversible turned out to be non-trivial to say the least.
// From
// |
// v
// .----------------------------------------.
// | |
// | j--------.
// | | |
// | i k | |
// | .--------|-----|---+----|----------------.
// | |1 a v | b--->B v 2|<-- To
// | .--->M | I | | J |
// | | | E v | | |
// | | | ^ A d c | | |
// | .-------|--+--|-----|--|--|---+---------. |
// | | m |3 | g | v v | | |
// | | | e | | C K | | |
// | | p | v | | | |
// | l | | n o-->O G h | f------------->F |
// '----|-----+-|---|----+------|-|---------' | |
// | | | | |4 v v | |
// | | | | | H D | |
// | | | `----------------------------------->N |
// | | `--------->P | |
// `----------------->L | |
// | | | |
// | '----------------------------+-----------'
// | |
// | q | r
// | |
// | |<-- Illegal
// '---------------------------------------'
(and that doesn't even include escape characters)
But I think I succeeded. If anyone manages to find arguments to u8string_to_filename that does not convert back with filename_to_u8string let me know!
First of all I needed a function that returns the number of bytes of a glyph:
// Returns the length of the UTF8 encoded glyph, which is highly
// recommended to be either guaranteed correct UTF8, or points
// inside a zero terminated string.
//
// If the pointer does not point to a legal UTF8 glyph then 1 is returned.
// The zero termination is necessary to detect the end of the string
// in the case that the apparent encoded glyph length goes beyond the string.
//
int utf8_glyph_length(char8_t const* glyph)
{
// The length of a glyph is determined by the first byte.
// This magic formula returns 1 for 110xxxxx, 2 for 1110xxxx,
// 3 for 11110xxx and 0 otherwise.
int extra = (0x3a55000000000000 >> ((*glyph >> 2) & 0x3e)) & 0x3;
// Detect if there are indeed `extra` bytes that follow the first
// one, each of which must begin with 10xxxxxx to be legal UTF8.
int i = 0;
while (++i <= extra)
if (glyph[i] >> 6 != 2)
return 1; // Not legal UTF8 encoding.
return 1 + extra;
}
You can find this file here.
Next we need a simple Dictionary class:
class Dictionary
{
private:
std::vector<std::u8string_view> m_words;
public:
Dictionary(std::u8string const&);
size_t size() const { return m_words.size(); }
void add(std::u8string_view glyph);
int find(std::u8string_view glyph) const;
std::u8string_view operator[](int index) const { return m_words[index]; }
};
with its definition
Dictionary::Dictionary(std::u8string const& in)
{
// Run over each glyph in the input.
int glen; // The number of bytes of the current glyph.
for (char8_t const* glyph = in.data(); *glyph; glyph += glen)
{
glen = utf8_glyph_length(glyph);
m_words.emplace_back(glyph, glen);
}
}
void Dictionary::add(std::u8string_view glyph)
{
if (find(glyph) == -1)
m_words.push_back(glyph);
}
int Dictionary::find(std::u8string_view glyph) const
{
for (int index = 0; index < m_words.size(); ++index)
if (m_words[index] == glyph)
return index;
return -1;
}
I also used the following two helper functions
char8_t to_hex_digit(int d)
{
if (d < 10)
return '0' + d;
return 'A' + d - 10;
}
std::u8string to_hex_string(char8_t c)
{
std::u8string hex_string;
hex_string += to_hex_digit(c / 16);
hex_string += to_hex_digit(c % 16);
return hex_string;
}
Finally, here is the encoder function
// Copy str to the returned filename, replacing every occurance of
// the utf8 glyphs in `from` with the corresponding one in `to`.
//
// All glyphs in `illegal` will be escaped with a percentage sign (%)
// followed by two hexidecimal characters for each code point of
// the glyph.
//
// If `from` does not contain the escape character, then each '%' will
// be replaced with "%%".
//
// All glyphs in `to` that are not in `from` are considered illegal
// and will also be escaped.
//
std::filesystem::path u8string_to_filename(std::u8string const& str,
std::u8string const& illegal, std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
// All glyphs are found by their first byte.
// Build a dictionary for each of the three strings.
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
Dictionary illegal_dictionary(illegal);
// The escape character is always illegal (is not allowed to appear on its own
// in the output).
illegal_dictionary.add({ &escape, 1 });
// For each `from` entry there must exist one `to` entry.
ASSERT(from_dictionary.size() == to_dictionary.size());
std::filesystem::path filename;
// Run over all glyphs in the input string.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = str.data(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// Perform translation.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
else if (*gp == escape)
{
filename += escape;
filename += escape;
continue;
}
// What is in illegal is *always* illegal - even when it is the result
// of a translation.
if (illegal_dictionary.find(glyph) != -1 ||
// If an input glyph is not in the from_dictionary (aka, it
// wasn't just translated) but it is in the to_dictionary -
// then also escape it. This is necessary to make sure that
// each unique input str results in a unique filename (and
// consequently is reversible).
(from_index == -1 && to_dictionary.find(glyph) != -1))
{
// Escape illegal glyphs.
// Always escape the original input (not a possible translation),
// otherwise we can't know if what the input was when decoding:
// the input could have been translated first or not.
for (int j = 0; j < glen; ++j)
{
filename += escape;
filename += to_hex_string(gp[j]);
}
continue;
}
// Append the glyph to the filename.
filename += glyph;
}
return filename;
}
And the decoder function
std::u8string filename_to_u8string(std::filesystem::path const& filename,
std::u8string const& from, std::u8string const& to)
{
using namespace detail::us2f;
std::u8string input = filename.u8string();
std::u8string result;
Dictionary from_dictionary(from);
Dictionary to_dictionary(to);
// First unescape all bytes in the filename.
int glen; // The number of bytes of the current glyph.
for (char8_t const* gp = input.c_str(); *gp; gp += glen)
{
glen = utf8_glyph_length(gp);
std::u8string_view glyph(gp, glen);
// First translate escape sequences back - those are then always
// original input.
if (*gp == escape)
{
if (gp[1] == escape)
{
glen = 2; // Skip the second escape character too.
result += escape;
}
else
{
char8_t val = 0;
for (int d = 1; d <= 2; ++d)
{
val <<= 4;
val |= ('0' <= gp[d] && gp[d] <= '9') ? gp[d] - '0'
: gp[d] - 'A' + 10;
}
result += val;
glen = 3; // Skip the two hex digits too.
}
continue;
}
else
{
// Otherwise - if the character is in the from dictionary, it must have
// been translated - otherwise it would have been escaped.
int from_index = from_dictionary.find(glyph);
if (from_index != -1)
glyph = to_dictionary[from_index];
}
result += glyph;
}
return result;
}
You can find this all back (and the latest version) on github

Order of precedence in C++: & or ()?

Provided that texts is an array of 3 strings, what's the difference between &texts[3] and (&texts)[3]?
The [] subscript operator has a higher precedence than the & address-of operator.
&texts[3] is the same as &(texts[3]), meaning the 4th element of the array is accessed and then the address of that element is taken. Assuming the array is like string texts[3], that will produce a string* pointer that is pointing at the 1-past-the-end element of the array, ie similar to an end iterator in a std::array or std::vector.
----------------------------
| string | string | string |
----------------------------
^
&texts[3]
(&texts)[3], on the other hand, takes the address of the array itself, producing a string(*)[3] pointer, and then increments that pointer by 3 whole string[3] arrays. So, again assuming string texts[3], you have a string(*)[3] pointer that is WAY beyond the end boundary of the array.
---------------------------- ---------------------------- ----------------------------
| string | string | string | | string | string | string | | string | string | string |
---------------------------- ---------------------------- ----------------------------
^ ^
&texts[3] (&texts)[3]

Memory map of what happens when we use command line arguments? [duplicate]

This question already has answers here:
What does int argc, char *argv[] mean?
(12 answers)
Closed 6 years ago.
What I understand is argc holds total number of arguments. Suppose my program takes 1 argument apart from program name. Now what does argv hold? Two pointer eg: 123,130 or ./hello\0 and 5. If it holds 123 how does it know it has read one argument? Does it know because of \0.
If all the above is wrong, can someone help me understand using memory map.
The argv array is an array of strings (where each entry in the array is of type char*). Each of those char* arrays is, itself, NUL-terminated. The argv array, itself, does not need to end in NULL (which is why a separate argc variable is used to track the length of the argv array).
In terms of those arrays being constructed to begin with, this is dependent on the calling program. Typically, the calling program is a shell program (such as BASH), where arguments are separated via whitespace (with various quoting options available to allow arguments to include whitespace). Regardless of how the argc, argv parameters are constructed, the operating system provides routines for executing a program with this as the program inputs (e.g. on UNIX, that method is one of the various variations of exec, often paired with a call to fork).
To make this a bit more concrete, suppose you ran:
./myprog "arg"
Here is an example of how this might look in memory (using completely fake addresses):
Addresss | Value | Comment
========================
0058 | 2 | argc
0060 | 02100 | argv (value is the memory address of "argv[0]")
...
02100 | 02116 | argv[0] (value is the memory address of "argv[0][0]")
02104 | 02300 | argv[1] (value is the memory address of "argv[1][0]")
...
02116 | '.' | argv[0][0]
02117 | '/' | argv[0][1]
02118 | 'm' | argv[0][2]
02119 | 'y' | argv[0][3]
02120 | 'p' | argv[0][4]
02121 | 'r' | argv[0][5]
02122 | 'o' | argv[0][6]
02123 | 'g' | argv[0][7]
02124 | '\0' | argv[0][8]
...
02300 | 'a' | argv[1][0]
02301 | 'r' | argv[1][1]
02302 | 'g' | argv[1][2]
02303 | '\0' | argv[1][3]

Will delete[] after strcpy cause memory leak?

char* myChar=new char[20];
char* myChar2="123";
strcpy(myChar, myChar2);
...
delete[] myChar;
My question is if strcpy puts a '\0' at the end of "123", then will delete[] myChar only delete the first 3 chars and fail to delete the rest of myChar?
Thank you...
No, delete [] deallocates all the memory allocated by new [] as long as you pass the same address to delete [] that was returned by new [].
It just correctly remembers how much memory was allocated irrespective of what is placed at that memory.
Your delete[] deallocates all of 20 chars, not only 3+1 that you really did use.
Delete doesn't look for "\n" while deleting a character string.
Rather the compiler looks for "\n" while allocating the memory-chunk for your string.
Hence, deleting both myChar, and myChar2 would hence work in exactly the same way, by looking at the size of memory-chunk that was actually allocated for the particular pointer. This emplies no memory leaks in your situation.
This is a fundamental aspect of C++ that needs understanding. It causes confusion that has its ground. Look a the example:
char* myChar1 = new char[20];
char* myChar2 = (char*)malloc(20);
In spite of the fact that both pointers have the same type, you should use different methods to release objects that they are pointing at:
delete [] myChar1;
free(myChar2);
Note that if you do:
char *tmp = myChar1;
myChar1 = myChar2;
myChar2 = myChar1;
After that you need:
delete [] myChar2;
free(myChar1);
You need to track the object itself (i.e. how it was allocated), not the place where you keep a pointer to this object. And release the object that you want to release, not the place that stores info about this object.
char* myChar=new char[20]; // you allocate 20 space for 20 chars
+-----------------+
myChar -> | x | x | ... | x | // x = uninitialized char
+-----------------+
char* myChar2="123";
+----------------+
myChar2 -> | 1 | 2 | 3 | \0 | // myChar2 points to string
+----------------+
strcpy(myChar, myChar2); // copy string to 20 char block
// strcpy copies char by char until it finds a \0 i.e. 4 chars
// in this case
+----------------------------------+
myChar -> | 1 | 2 | 3 | \0 | x | x | ... | x |
+----------------------------------+
// note that characters after the string 123\0 are
// still uninitialized
delete[] myChar;
// the whole 20 chars has been freed

Does a char* when assigned a value terminates the String With a null?

suppose i have in my code written following :-
char *abc = " Who cares";
int len= strlen(abc);
This provides me the length of abc . My Doubt is how does Strlen determines the length of
abc here . Certainly it looks for null termination and returns the value . But that does that mean that abc is assigned Null at the place where i am initializing it with value " Who cares " ?
char *abc = " Who cares";
declares a pointer abd to a string literal "Who Cares" located somewhere in read only(Implementation defined) location. Yes, it is NULL terminated.
Do not try to modify this string literal though because it will lead to an Undefined Behavior.
Also, in C++ the correct way to declare this is:
const char *abc = " Who cares";
Yes, strlen walks through the memory pointed to by abc until it finds a null termination character.
abc is not initialized with null. The compiler places the string somewhere in memory (including an implicit null termination character); abc is then initialized with the address of the first character in the string.
So:
0x1234 0x123E (example addresses)
+--+--+--+--+--+--+--+--+--+--+--+
| |W |h |o | |c |a |r |e |s |\0|
+--+--+--+--+--+--+--+--+--+--+--+
^
|
|
abc = 0x1234