How many byte read the std::istream::peek() function [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have try to read the next character in a file with only characters and with a file with only integers. This function return the next value (int or char). Now the question is how many byte read peek()? For first file it seems read one byte while for the second file it seems read four byte. How it's possible?

[H]ow many byte[s are] read [by] peek()?
std::ifstream::peek() reads one character (i.e. one byte) from the file, and returns it inside an int (the use of int is so that there is sufficient range to conditionally return EOF).
(Other std::basic_istream specialisations may have different char_type and int_type aliases, so the exact types and numbers given may differ if you use them. But the key is still that you extract one character, even if you think your ASCII file contains "numbers".)
It doesn't matter what the bytes of your file are: human-readable ASCII text, human-readable ASCII numbers, an encoded ZIP, random values… std::ifstream::peek() is an "unformatted input function", which is a type of IOStream function that works on the characters in a file.
For first file it seems read one byte while for the second file it seems read four byte.
How it's possible?
It's not. You did something wrong.

It depends entirely on the type of stream.
peek() will read one character, however streams have traits associated with them that dictate what is considered a character. The char_traits on a stream controls this.
Here's a handy guide:
char : always 1 byte
wchar_t : platform defined. dictated by its *underlying type* Likely 4 bytes
char32_t: at least 4 bytes
char16_t: at least 2 bytes
fstream is actually a basic_fstream<char> meaning that it will read the same number of bytes as sizeof(char), which is 1 byte. It will return int_type which is also controlled through traits

Related

Returning to a specific place in a file after extraction of characters [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a question, after I loop on a file and extract several letters with a counter to know how many characters have been extracted how can I reposition my pointer to point back the first one extracted. Here is what I have tried so far:
int get_length(ifstream &inp,int &length){
int columns=0;
inp>>columns;
length++;
while(columns!=0)
{
inp>>columns;
length++;
}
if (!inp.good())
inp.clear();
inp.seekg(-length,std::ios::cur);
return length;
}
For some reason its not going back the same length, it's getting it wrong by one, I've tried adding to length by one then writing that seek function I don't know what's wrong here, I'm questing if I'm using the seek function incorrectly?
I think that the problem is this:
You are incrementing 'length' each time an integer value is read from the fstream 'inp'. Depending on on how many characters wide the integer representation is you will need to increment length by that amount. That and new-line chars and any other whitespace in the fstream.
If your test data contains:
10
11
12
13
Then by the time you read 13 you will have consumed 12 bytes of file data.
Your counter will have only incremented 4 times.
You could do this more easily and accurately by placing a call to
auto const position_start = inp.tellg();
at the start of your function and once you read the data you're interested in 'rewind' to the start position with a call to
inp.seekg(position_start, std::ios::beg);
'ifstream' is a specialisation of 'basic_ifstream', so 'ifstream::seekg()' takes an offset in bytes (chars). However, the formatted input (to an int) will advance the the current position by some number of bytes (0 or more) as it converts the input to an integer value. Use 'ifstream::tellg()' at the top of the function to get the current file position and another call to 'tellg()' before calling 'seekg()' to get the new file position. The difference in the two values will be 'length'.

The C++ equivalent of C's format string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a C program that reads from keyboard, like this:
scanf("%*[ \t\n]\"%[^A-Za-z]%[^\"]\"", ps1, ps2);
For a better understanding of what this instruction does, let's split the format string as follows:
%*[ \t\n]\" => read all spaces, tabs and newlines ([ \t\n]) but not store them in any variable (hence the '*'), and will keep reading until encounter a double quote (\"), however the double quote is not input.
Once scanf() has found the double quote, reads all caracters that are not letters into ps1. This is accomplished with...
%[^A-Za-z] => input anything not an uppercase letter 'A' through 'Z' and lowercase letter 'a' through 'z'.
%[^\"]\" => read all remaining characters up to, but not including a double quote into ps2 ([^\"]) and the string must end with a double quote (\"), however the double quote is not input.
Can someone show me how to do the same thing in C++
Thank you
C++ supports the scanf function. There is no simple alternative, especially if you want to replicate the exact semantics of scanf() with all the quirks.
Note however that your code has several issues:
You do not pass the maximum number of characters to read into ps1 and ps2. Any sufficiently input sequence will cause a buffer overflow with dire consequences.
You could simplify the first format %*[ \t\n] with just a space in the format string. This would also allow for the case where no whitespace characters are present. As currently written, scanf() would fail and return 0 if no whitspace characters are present before the ".
Similarly, if no non letters or if no other characters follow before the second ", scanf would return a short count of 0 or 1 and leave one or both destination array in an indeterminate state.
For all these reasons, it would be much safer and predictable in C to first read a line of input with fgets() and use sscanf() or parse the line by hand.
In C++, you definitely want to use the std::regex package defined in <regex.h>.

How do Binary Files works? (From c++'s point of view) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have some missunderstandings about Binary files, i dont understand what a binary file is, i know text files are also binary files, but it needs to be parsed in order to extract information, unlike text files binary files with the same contents looks diffrent, for example while storing my name in a binary file "Rishabh" it not only stores Rishabh in that file but with some extra unreadable characters, what is it?? Why does'nt it only store characters like a text file, and what are binary file formats, eg. .3d, .zip, .mp3 etc... From my knowledge in text files, format extension specifies what the format is or how to process that file, like .dae, .xml, .htm etc... These contains tags to store datas, but what about binary files, because it dont needs any tags because its stored as a variable in that file from which we have to copy contents to the programs variables, (i mean to say its like stored in memory) so why these binary file formats are diffrent, why just not only a single program read all the contents of the file which is unkown to the world and to me?? And what is binary file format cracking??
All files have some kind of pre-determined encoding since computers can't store anything but bit-patterns in bytes on disk. A text file contains only the encoding for printable characters plus space, and few other encodings to end-a-line, tab, and maybe form-feed and a few others related to character display on a device. Because the encoding in a text file is a well-known standard, and is quite common, there are functions in most, if not all languages, to deal specifically with that type of file. Most importantly, they know how to read a line at a time - they recognize line-terminator character(s).
If however, you type the characters of your name in some other program besides a text editor - say you write using the text tool in Gimp or Microsoft Paint, and then save it. The program has to save more information than just your name. Your name has a position on a canvas that must be saved. It also has a font and a size and whether it is bold or italic or underlined, that need to be saved. The size of the canvas needs to be saved. The color being used, even if white and black, needs to be saved. This encoding will be different than the encoding used to save the letters of your name. So if you edit the file with a text editor, you will see some gibberish since the text editor is expecting character encoding and knows nothing about the encoding Gimp uses for fonts, font sizes, x,y positions, etc.
C++ compilers are not written with routines to understand any binary file encodings. The routines for reading/writing binary files in C++ will just read and write sequences of bytes. Although, since the fundamental type that holds a byte of data in C++ is a char (or unsigned char), you will see binary prototypes like
write ( char * buffer, streamsize size );
read ( char * buffer, streamsize size );
But the char pointer in this case should be considered as a "byte *" since the read/write functions are just moving bytes of data from/to disk or memory without any regard for character encodings.
C++ read/write routines don't know, or care what the format or encoding is for the bytes they are moving. So it is left up to the programmer to write code to process or handle these bytes according to the pre-defined format for the file. However, the routines written to process a specific format of binary file can be compiled into a library that can then be shared or sold, and used by many C++ programmers. For example, LibXL can be used to read the binary format of Excel files from a C++ program.
From the perspective of C/C++, the only difference between text and binary files is how line endings are handled.
If you open a file in binary mode, then read reads exactly the bytes in the file, and write writes exactly the bytes which are in memory.
If you open a file in text mode, then whatever character or character sequence is conventionally used to represent the end of a line in a file is transformed into some single character (which is written in the source code as '\n', although it is only one character) when the file is read, and the \n is transformed into the conventional end-of-line character or sequence when the file is written to. Also, it is not technically legal for the file to not end with an end-of-line sequence, and there may be a limit to the length of a line.
In Unix, the two modes are identical, because \n is a representation of the character code 10 (0A in hex), and that is precisely the conventional line-ending character. In Windows, by contrast, the conventional line-ending sequence is two bytes long -- {13,10} or {0D,0A}. \n is still 0A, so effectively the 0D preceding the 0A is deleted from the data read from the file, and an 0D is inserted before every 0A when data is written to the file.
Some (much) older operating systems had no conventional line-ending character. Instead, all lines were padded with space characters to the exact same length, making it possible to directly seek to a specific line number. C libraries working in text mode would typically read exactly the line length, and then delete the trailing spaces (if any) and finally add the code corresponding to \n (some such systems used EBCDIC instead of ASCII, so \n was a different integer value). Writing the data out, the \n would be deleted and replaced with exactly the correct number of spaces to bring the line to the standard length. Fortunately, those of us who don't work in a computing museum don't have to deal with that stuff any more, and Apple abandoned its use of 0D as the line-end character with the advent of OSX, so the text/binary difference is now limited to Windows.
Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.
Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!
Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)
Its one of the fundamentals of a Computer system.
Probably a great explanation in this link
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
Some text quoted:
Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.
A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.

Word Length as defined in the .ZIP format specification

So I've been reading through PKWARE's specification of the .zip file format and have noticed that in several places they give block sizes in terms of words (the processor word, not the dictionary word :-) ).
Now, the way I understand it, the byte size of a word is specific to a certain processor family. So if a file was zipped on an i386 and then unzipped on an x64-86, the two architectures would have different definitions of a word (4 bytes vs. 8 bytes) and would therefore interpret the block data differently.
Am I missing something here? Or do the folks at PKWARE simply assume that 1 word = 4 bytes? That seems like the most likely option to me - I've checked some zip files with a hex editor and the 4-byte definition would fit, but I'd like some confirmation because its not like I have a whole bunch of different processors to test with :)
Thanks in advance, and sorry if the question already exists - I did try searching but it's a little difficult because the word "word" is so ambiguous (see what I mean?)
Where the specification says "word" for a stored block in the deflate format, it means two bytes (in LSB order).
For zip decryption (where said encryption should not be used since it's so weak), again a word means two bytes.
When it talks about a general purpose flag word under imploding, it again means two bytes.

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.