ASCII Value for Nothing - c++

Is there an ascii value I can put into a char in C++, that represents nothing? I tried 0 but it ends up screwing up my file so I can't read it.

ASCII 0 is null. Other than that, there are no "nothing" characters in traditional ASCII. If appropriate, you could use a control character like SOH (start of heading), STX (start of text), or ETX (end of text). Their ASCII values are 1, 2, and 3 respectively.
For the full list of ASCII codes that I used for this explaination, see this site

Sure. Use any character value that won't appear in your regular data. This is commonly referred to as a delimited text file. Popular choices for delimiters include spaces, tabs, commas, semi-colons, vertical-bar characters, and tilde.

In a C++ source file, '\0' represents a 0 byte. However, C++ strings are usually null-terminated, which means that '\0' represents the end of the string - which may be what is messing up your file.
If you really want to store a 0 byte in a data file, you need to use some other encoding. A simplistic one would use some other character - 0xFF, for example - that doesn't appear in your data, or some length/data format or something similar.
Whatever encoding you choose to use, the application writing the file and the one reading it need to agree on what the encoding is. And that is a whole new nightmare.

The null character '\0' still takes up a byte.
Does your software recognize the null character as an end-of-file character?
If your software is reading in this file, you can define a place holder character (one that isn't the same as data) but you'll also need to handle that character. As in, say '*' is your place-holder. You will read in the character but not add it to the structure that stores your data. It will still take up space in your file, but it won't take up space in your data structure.
Am I answering your question or missing it?

Do you mean a value you can write which won't actually change the file? The answer is no.
Maybe post a little more about what you're trying to accomplish.

it would depend on what kind of file it is and who is parsing it.

Related

C++ get the size (in bytes) of EOL

I am reading an ASCII text file. It is defined by the size of each field, in bytes. E.g. Each row consists of a 10 bytes for some string, 8 bytes for a floating point value, 5 bytes for an integer and so on.
My problem is reading the newline character, which has a variable size depending on the OS (usually 2 bytes for windows and 1 byte for linux I believe).
How can I get the size of the EOL character in C++?
For example, in python I can do:
len(os.linesep)
The time honored way to do this is to read a line.
Now, the last char should be \n. Strip it. Then, look at the previous character. It will either be \r or something else. If it's \r, strip it.
For Windows [ascii] text files, there aren't any other possibilities.
This works even if the file is mixed (e.g. some lines are \r\n and some are just \n).
You can tentatively do this on few lines, just to be sure you're not dealing with something weird.
After that, you now know what to expect for most of the file. But, the strip method is the general reliable way. On Windows, you could have a file imported from Unix (or vice versa).
I'm not sure that the translation occurs where you think it is. Look at the following code:
ostringstream buf;
buf<< std::endl;
string s = buf.str();
int i = strlen(s.c_str());
After this, running on Windows, i == 1. So the end of line definition in std is 1 character. As others have commented, this is the "\n" character.

Unrecognizable character in C++

I'm programming an application that converts .txt files to bags of words for text mining. However, I keep getting non-alphabetic characters ( like ¾ and =) even though my application filters non-alphabetic characters:
My vector passes through a loop which erases strings that begins with a char with an ASCII value other than [65,90] (from A to Z). These characters also pass the isalpha test. It seems like these characters can't be distinguished from alphabetic characters.
I don't see how I can remove these weird strings dynamically from my vector of strings. I need help.
My code because it is quite long for a forum post.
This part of my code fails to get rid of the strings beginning with non-aphabetic characters:
for (unsigned int i=0; i<token24.size();i++){
string temp = token24[i];
char c = temp[0];
if(c>90||c<65){
token24.erase(token24.begin()+i);
i--;
}
}
I also tried with the condition
(c>'Z'||c<'A')
You could always do a string replace the characters with whitespace, but that just handles the specific cases of specific characters, not the larger problem.
I don't think we can do anything for you until we see the code.
The most important part in programs like yours is handling the content of .txt file. Such file can be a Unicode text, which in turn can be encoded, for eample, with UTF-8. Then, single byte can be only a part of a character, not character itself. Are you sure you load (and possibly, decode) the file in a proper way?
Also, don't you think that lower letters are also valid alpha characters?

How to detect a tab in a text file?

Is detecting tabs same as detecting the spaces? i.e. for detecting a space, I would just compare the space character with its ascii number.
For a tab do I have to search for '\t' character in the file or there is some other way?
if('\t' == myChar)
This would work, and would be better than checking against 9 since 9 may not be a guaranteed value across all architectures.
Assuming you are working with ASCII data, you can just search for a byte with value '\t' (9) in the text file. Tabs are represented as a single byte in text files and most libraries for reading files don't do anything special with those bytes.
A tab is just another character so you can check for the ASCII value if you want.
Although a tab appears as 4 or 8 spaces in an editor, it is actually represented as a single character ('\t', like you mentioned) inside a file. Both the space character and the tab character take up one byte. So basically, u are correct in your assumption.

Delimiting Character

We are loading a Fixed width text file into a SAS dataset.
The character we are using to delimit multi valued field values is being interpreted as 2 characters by SAS. This breaks things, because the fields are of a fixed width.
We can use characters that appear on the keyboard, but obviously this isn't as safe, because our data could actually contain those characters.
The character we would like to use is '§'.
I'm guessing this may be an encoding issue, but don't know what to do about it.
Could you use the keycode for the character like DLM='09'x and change 09 to the right keycode?

Reading a UTF-8 Unicode file through non-unicode code

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.
My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.
What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.
My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?
UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.
Thus there shall not be 0 character in your UTF-8 file.
Check Wikipedia for UTF-8
Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.