I have a program where I need to operate on different types of files.
I want the input and output files of the following program to be the same.
#include<iostream>
#include<string>
#include<fstream>
#include<sstream>
typedef unsigned char u8;
using namespace std;
char* readFileBytes(string name)
{
ifstream fl(name);
fl.seekg( 0, ios::end );
size_t len = fl.tellg();
char *ret = new char[len];
fl.seekg(0, ios::beg);
fl.read(ret, len);
fl.close();
return ret;
}
int main(int argc, char *argv[]){
string name = "file.pdf";
u8* file = (u8*) readFileBytes(name);
// cout<<str<<endl;
int len = 0;
while(file[len] != '\0')
len++;
cout<<"FILESIZE : "<<len<<endl;
string filename = "file2.pdf";
ofstream outfile(filename,ios::out | ios::binary);
outfile.write((char*) file,len);
outfile.close();
exit(0);
}
The difference between the output and input files is checked using diff
diff file.pdf file2.pdf
What should I do to make file2.pdf the same as file.pdf?
I have tried using xxd to change the binary into hexadecimal but the disadvantage is that the overall size doubles. So therefore I want to operate in binary only.
size_t len = fl.tellg();
char *ret = new char[len];
In this manner the shown code determines the number of characters in the file. This is fine. The only problem with it is that after this number of characters is read, this very important information is completely forgotten and thrown away. This function returns only this ret pointer, and the actual number of characters in it is now an unsolvable mystery.
But then, main() attempts to solve this mystery as follows:
int len = 0;
while(file[len] != '\0')
len++;
This attempts to reverse-engineer the number of characters by looking for the first 0 byte in the buffer.
Which has absolutely nothing to do with anything. The first character in the file may be a 0 byte, so this will calculate that the file is empty, and not ten gigabytes in size.
Or the file can contain just a string "Hello world", which this for loop will happily blow past, then start rooting around in some random memory after this buffer, resulting in undefined behavior.
That's the fatal logical flaw in the shown code: the actual size of the file is thrown away, and instead reverse-engineered in a flawed way.
You will need to rework the code so that the number of characters in the file, the original len, is also returned to main(), and it uses that, instead of attempting to guess what it originally was.
P.S. delete-ing the ret buffer, after you're done with it, would also be a good idea too. An even better idea is to avoid using new, using vector instead, which will happily give you its size() any time you ask for it, and you won't have to worry about deleting the allocated memory.
In order to correctly process binary data, the size must be stored and cannot be computed from a sentinel null byte, because null bytes can be legimate bytes in a binary file. So you should return the read lenght in addition to the buffer, or even better copy each buffer to the new file until you have exhausted the input file:
int main(int argc, char *argv[]){
constexpr size_t sz = 10240; // size of buffer
char buffer[sz];
string name = "file.pdf";
string filename = "file2.pdf";
ifstream fl(name);
ofstream outfile(filename,ios::out | ios::binary);
int len = 0, buflen;
for (;;) {
buflen = fl.read(buf, len);
if (buflen == 0) break; // reached EOF
len += buflen;
if (buflen != outfile.write(buf, buflen)) {
// display an error message
return 1;
}
}
fl.close();
outfile.close()
cout<<"FILESIZE : "<<len<<endl;
exit(0);
}
Related
I am trying to write a simple program that reads a file by encapsulating functions like open, lseek, pread.
My file for test contains:
first second third forth fifth sixth
seventh eighth
my main function that tries to read 20 bytes with offset 10 from the file:
#include <iostream>
#include "CacheFS.h"
using namespace std;
int main(int argc, const char * argv[]) {
char * filename1 = "/Users/Desktop/File";
int fd1 = CacheFS_open(filename1);
//read from file and print it
void* buf[20];
CacheFS_pread(fd1, &buf, 20, 10);
cout << (char*)buf << endl;
}
implementation of the functions that the main is using:
int CacheFS_open(const char *pathname)
{
mode_t modes = O_SYNC | 0 | O_RDONLY;
int fd = open(pathname, modes);
return fd;
}
int CacheFS_pread(int file_id, void *buf, size_t count, off_t offset)
{
off_t seek = lseek(file_id, offset, SEEK_SET);
off_t fileLength = lseek(file_id, 0, SEEK_END);
if (count + seek <= fileLength) //this case we are not getting to the file end when readin this chunk
{
pread(file_id, &buf, count, seek);
} else { //count is too big so we can only read a part of the chunk
off_t size = fileLength - seek;
pread(file_id, &buf, size, seek);
}
return 0;
}
My main function prints this to the console:
\350\366\277_\377
I would expect it to print some values from the file itself, and not some numbers and slashes that represenet something I do not really understand.
Why does this happen?
The following changes will make your program work:
Your buffer has to be an existent char array and your CacheFS_pread function is called without the address operator & then. Also use the buffer size minus 1 because the pread function will override the terminating \0 because it's just read n bytes of the file. I use a zero initialized char array here so that there will be a null terminating \0 at least at the end.
char buf[20] = { '\0' }; // declare and initialize with zeros
CacheFS_pread(fd1, buf, sizeof(buf) - 1, 10);
Your function header should accept only a char pointer for typesafety reasons.
int CacheFS_pread(int file_id, char* buf, size_t count, off_t offset)
Your pread call is then without the address operator &:
pread(file_id, buf, count, seek);
Output: nd third forth fift because buffer is just 20!
Also I would check if your calculations and your if statements are right. I have the feeling that it's not exactly right. I would also recomment to use the return value of pread.
Noobie Alert.
Ugh. I'm having some real trouble getting some basic file I/O stuff done using <stdio.h> or <fstream>. They both seem so clunky and non-intuitive to use. I mean, why couldn't C++ just provide a way to get a char* pointer to the first char in the file? That's all I'd ever want.
I'm doing Project Euler Question 13 and need to play with 50-digit numbers. I have the 150 numbers stored in the file 13.txt and I'm trying to create a 150x50 array so I can play with the digits of each number directly. But I'm having tons of trouble. I've tried using the C++ <fstream> library and recently straight <stdio.h> to get it done, but something must not be clicking for me. Here's what I have;
#include <iostream>
#include <stdio.h>
int main() {
const unsigned N = 100;
const unsigned D = 50;
unsigned short nums[N][D];
FILE* f = fopen("13.txt", "r");
//error-checking for NULL return
unsigned short *d_ptr = &nums[0][0];
int c = 0;
while ((c = fgetc(f)) != EOF) {
if (c == '\n' || c == '\t' || c == ' ') {
continue;
}
*d_ptr = (short)(c-0x30);
++d_ptr;
}
fclose(f);
//do stuff
return 0;
}
Can someone offer some advice? Perhaps a C++ guy on which I/O library they prefer?
Here's a nice efficient solution (but doesn't work with pipes):
std::vector<char> content;
FILE* f = fopen("13.txt", "r");
// error-checking goes here
fseek(f, 0, SEEK_END);
content.resize(ftell(f));
fseek(f, 0, SEEK_BEGIN);
fread(&content[0], 1, content.size(), f);
fclose(f);
Here's another:
std::vector<char> content;
struct stat fileinfo;
stat("13.txt", &fileinfo);
// error-checking goes here
content.resize(fileinfo.st_size);
FILE* f = fopen("13.txt", "r");
// error-checking goes here
fread(&content[0], 1, content.size(), f);
// error-checking goes here
fclose(f);
I would use an fstream. The one problem you have is that you obviously can't fit the numbers in the file into any of C++'s native numeric types (double, long long, etc.)
Reading them into strings is pretty easy though:
std::fstream in("13.txt");
std::vector<std::string> numbers((std::istream_iterator<std::string>(in)),
std::istream_iterator<std::string>());
That will read in each number into a string, so the number that was on the first line will be in numbers[0], the second line in numbers[1], and so on.
If you really want to do the job in C, it can still be quite a lot easier than what you have above:
char *dupe(char const *in) {
char *ret;
if (NULL != (ret=malloc(strlen(in)+1))
strcpy(ret, in);
return ret;
}
// read the data:
char buffer[256];
char *strings[256];
size_t pos = 0;
while (fgets(buffer, sizeof(buffer), stdin)
strings[pos++] = dupe(buffer);
Rather than reading the one hundred 50 digit numbers from a file, why not read them directly in from a character constant?
You could start your code out with:
static const char numbers[] =
"37107287533902102798797998220837590246510135740250"
"46376937677490009712648124896970078050417018260538"...
With a semicolon at the last line.
Like always, problems with pointers. This time I am trying to read a file (opened in binary mode) and store some portion of it in a std::string object.
Let's see:
FILE* myfile = fopen("myfile.bin", "rb");
if (myfile != NULL) {
short stringlength = 6;
string mystring;
fseek(myfile , 0, SEEK_SET);
fread((char*)mystring.c_str(), sizeof(char), (size_t)stringlength, myfile);
cout << mystring;
fclose(myfile );
}
Is this possible? I don't get any message. I am sure the file is O.K. When I try with char* it does work but I want to store it directly into the string. Thanks for your help!
Set the string to be large enough first to avoid buffer overrun, and access the byte array as &mystring[0] to satisfy const and other requirements of std::string.
FILE* myfile = fopen("myfile.bin", "rb");
if (myfile != NULL) {
short stringlength = 6;
string mystring( stringlength, '\0' );
fseek(myfile , 0, SEEK_SET);
fread(&mystring[0], sizeof(char), (size_t)stringlength, myfile);
cout << mystring;
fclose(myfile );
}
There are many, many issues in this code but that is a minimal adjustment to properly use std::string.
I would recommend this as the best way to do such a thing. Also you should check to make sure that all the bytes were read.
FILE* sFile = fopen(this->file.c_str(), "r");
// if unable to open file
if (sFile == nullptr)
{
return false;
}
// seek to end of file
fseek(sFile, 0, SEEK_END);
// get current file position which is end from seek
size_t size = ftell(sFile);
std::string ss;
// allocate string space and set length
ss.resize(size);
// go back to beginning of file for read
rewind(sFile);
// read 1*size bytes from sfile into ss
fread(&ss[0], 1, size, sFile);
// close the file
fclose(sFile);
string::c_str() returns const char* which you can not modify.
One way to do this would be use a char* first and construct a string from it.
Example
char buffer = malloc(stringlength * sizeof(char));
fread(buffer, sizeof(char), (size_t)stringlength, myfile);
string mystring(buffer);
free(buffer);
But then again, if you want a string, you should perhaps ask yourself Why am I using fopen and fread in the first place??
fstream would be a much much better option.
You can read more about it here
Please check out the following regarding c_str to see some things that are wrong with your program. A few issues include the c_str not being modifiable, but also that it returns a pointer to your string contents, but you never initialized the string.
http://www.cplusplus.com/reference/string/string/c_str/
As for resolving it... you could try reading into a char* and then initializing your string from that.
No it is not. std::string::c_str() method does not return a modifiable character sequence as you can validate from here. A better solution would be using a buffer char array. Here is an example:
FILE* myfile = fopen("myfile.bin", "rb");
if (myfile != NULL) {
char buffer[7]; //Or you can use malloc() / new instead.
short stringlength = 6;
fseek(myfile , 0, SEEK_SET);
fread(buffer, sizeof(char), (size_t)stringlength, myfile);
string mystring(buffer);
cout << mystring;
fclose(myfile );
//use free() or delete if buffer is allocated dynamically
}
I am currently doing some testing with a new addition to the ICU dictionary-based break iterator.
I have code that allows me to test the word-breaking on a text document but when the text document is too large it gives the error: bash: ./a.out: Argument list too long
I am not sure how to edit the code to break-up the argument list when it gets too long so that a file of any size can be run through the code. The original code author is quite busy, would someone be willing to help out?
I tried removing the printing of what is being examined to see if that would help, but I still get the error on large files (printing what is being examined isn't necessary - I just need the result).
If the code could be modified to read the source text file line by line and export the results line by line to another text file (ending up with all the lines when it is done), that would be perfect.
The code is as follows:
/*
Written by George Rhoten to test how word segmentation works.
Code inspired by the break ICU sample.
Here is an example to run this code under Cygwin.
PATH=$PATH:icu-test/source/lib ./a.exe "`cat input.txt`" > output.txt
Encode input.txt as UTF-8.
The output text is UTF-8.
*/
#include <stdio.h>
#include <unicode/brkiter.h>
#include <unicode/ucnv.h>
#define ZW_SPACE "\xE2\x80\x8B"
void printUnicodeString(const UnicodeString &s) {
int32_t len = s.length() * U8_MAX_LENGTH + 1;
char *charBuf = new char[len];
len = s.extract(0, s.length(), charBuf, len, NULL);
charBuf[len] = 0;
printf("%s", charBuf);
delete charBuf;
}
/* Creating and using text boundaries */
int main(int argc, char **argv)
{
ucnv_setDefaultName("UTF-8");
UnicodeString stringToExamine("Aaa bbb ccc. Ddd eee fff.");
printf("Examining: ");
if (argc > 1) {
// Override the default charset.
stringToExamine = UnicodeString(argv[1]);
if (stringToExamine.charAt(0) == 0xFEFF) {
// Remove the BOM
stringToExamine = UnicodeString(stringToExamine, 1);
}
}
printUnicodeString(stringToExamine);
puts("");
//print each sentence in forward and reverse order
UErrorCode status = U_ZERO_ERROR;
BreakIterator* boundary = BreakIterator::createWordInstance(NULL, status);
if (U_FAILURE(status)) {
printf("Failed to create sentence break iterator. status = %s",
u_errorName(status));
exit(1);
}
printf("Result: ");
//print each word in order
boundary->setText(stringToExamine);
int32_t start = boundary->first();
int32_t end = boundary->next();
while (end != BreakIterator::DONE) {
if (start != 0) {
printf(ZW_SPACE);
}
printUnicodeString(UnicodeString(stringToExamine, start, end-start));
start = end;
end = boundary->next();
}
delete boundary;
return 0;
}
Thanks so much!
-Nathan
The Argument list too long error message is coming from the bash shell and is happening before your code even gets started executing.
The only code you can fix to eliminate this problem is the bash source code (or maybe it is in the kernel) and then, you're always going to run into a limit. If you increase from 2048 files on command line to 10,000, then some day you'll need to process 10,001 files ;-)
There are numerous solutions to managing 'too big' argument lists.
The standardized solution is the xargs utility.
find / -print | xargs echo
is a un-helpful, but working example.
See How to use "xargs" properly when argument list is too long for more info.
Even xargs has problems, because file names can contain spaces, new-line chars, and other unfriendly stuff.
I hope this helps.
The code below reads the content of a file whos name is given as the first parameter on the command-line and places it in a str::buffer. Then, instead of calling the function UnicodeString with argv[1], use that buffer instead.
#include<iostream>
#include<fstream>
using namespace std;
int main(int argc, char **argv)
{
std::string buffer;
if(argc > 1) {
std::ifstream t;
t.open(argv[1]);
std::string line;
while(t){
std::getline(t, line);
buffer += line + '\n';
}
}
cout << buffer;
return 0;
}
Update:
Input to UnicodeString should be char*. The function GetFileIntoCharPointer does that.
Note that only the most rudimentary error checking is implemented below!
#include<iostream>
#include<fstream>
using namespace std;
char * GetFileIntoCharPointer(char *pFile, long &lRet)
{
FILE * fp = fopen(pFile,"rb");
if (fp == NULL) return 0;
fseek(fp, 0, SEEK_END);
long size = ftell(fp);
fseek(fp, 0, SEEK_SET);
char *pData = new char[size + 1];
lRet = fread(pData, sizeof(char), size, fp);
fclose(fp);
return pData;
}
int main(int argc, char **argv)
{
long Len;
char * Data = GetFileIntoCharPointer(argv[1], Len);
std::cout << Data << std::endl;
if (Data != NULL)
delete [] Data;
return 0;
}
Using the readlink function used as a solution to How do I find the location of the executable in C?, how would I get the path into a char array? Also, what do the variables buf and bufsize represent and how do I initialize them?
EDIT: I am trying to get the path of the currently running program, just like the question linked above. The answer to that question said to use readlink("proc/self/exe"). I do not know how to implement that into my program. I tried:
char buf[1024];
string var = readlink("/proc/self/exe", buf, bufsize);
This is obviously incorrect.
This Use the readlink() function properly for the correct uses of the readlink function.
If you have your path in a std::string, you could do something like this:
#include <unistd.h>
#include <limits.h>
std::string do_readlink(std::string const& path) {
char buff[PATH_MAX];
ssize_t len = ::readlink(path.c_str(), buff, sizeof(buff)-1);
if (len != -1) {
buff[len] = '\0';
return std::string(buff);
}
/* handle error condition */
}
If you're only after a fixed path:
std::string get_selfpath() {
char buff[PATH_MAX];
ssize_t len = ::readlink("/proc/self/exe", buff, sizeof(buff)-1);
if (len != -1) {
buff[len] = '\0';
return std::string(buff);
}
/* handle error condition */
}
To use it:
int main()
{
std::string selfpath = get_selfpath();
std::cout << selfpath << std::endl;
return 0;
}
Accepted answer is almost correct, except you can't rely on PATH_MAX because it is
not guaranteed to be defined per POSIX if the system does not have such
limit.
(From readlink(2) manpage)
Also, when it's defined it doesn't always represent the "true" limit. (See http://insanecoding.blogspot.fr/2007/11/pathmax-simply-isnt.html )
The readlink's manpage also give a way to do that on symlink :
Using a statically sized buffer might not provide enough room for the
symbolic link contents. The required size for the buffer can be
obtained from the stat.st_size value returned by a call to lstat(2) on
the link. However, the number of bytes written by readlink() and read‐
linkat() should be checked to make sure that the size of the symbolic
link did not increase between the calls.
However in the case of /proc/self/exe/ as for most of /proc files, stat.st_size would be 0. The only remaining solution I see is to resize buffer while it doesn't fit.
I suggest the use of vector<char> as follow for this purpose:
std::string get_selfpath()
{
std::vector<char> buf(400);
ssize_t len;
do
{
buf.resize(buf.size() + 100);
len = ::readlink("/proc/self/exe", &(buf[0]), buf.size());
} while (buf.size() == len);
if (len > 0)
{
buf[len] = '\0';
return (std::string(&(buf[0])));
}
/* handle error */
return "";
}
Let's look at what the manpage says:
readlink() places the contents of the symbolic link path in the buffer
buf, which has size bufsiz. readlink does not append a NUL character to
buf.
OK. Should be simple enough. Given your buffer of 1024 chars:
char buf[1024];
/* The manpage says it won't null terminate. Let's zero the buffer. */
memset(buf, 0, sizeof(buf));
/* Note we use sizeof(buf)-1 since we may need an extra char for NUL. */
if (readlink("/proc/self/exe", buf, sizeof(buf)-1) < 0)
{
/* There was an error... Perhaps the path does not exist
* or the buffer is not big enough. errno has the details. */
perror("readlink");
return -1;
}
char *
readlink_malloc (const char *filename)
{
int size = 100;
char *buffer = NULL;
while (1)
{
buffer = (char *) xrealloc (buffer, size);
int nchars = readlink (filename, buffer, size);
if (nchars < 0)
{
free (buffer);
return NULL;
}
if (nchars < size)
return buffer;
size *= 2;
}
}
Taken from: http://www.delorie.com/gnu/docs/glibc/libc_279.html
#include <stdlib.h>
#include <unistd.h>
static char *exename(void)
{
char *buf;
char *newbuf;
size_t cap;
ssize_t len;
buf = NULL;
for (cap = 64; cap <= 16384; cap *= 2) {
newbuf = realloc(buf, cap);
if (newbuf == NULL) {
break;
}
buf = newbuf;
len = readlink("/proc/self/exe", buf, cap);
if (len < 0) {
break;
}
if ((size_t)len < cap) {
buf[len] = 0;
return buf;
}
}
free(buf);
return NULL;
}
#include <stdio.h>
int main(void)
{
char *e = exename();
printf("%s\n", e ? e : "unknown");
free(e);
return 0;
}
This uses the traditional "when you don't know the right buffer size, reallocate increasing powers of two" trick. We assume that allocating less than 64 bytes for a pathname is not worth the effort. We also assume that an executable pathname as long as 16384 (2**14) bytes has to indicate some kind of anomaly in how the program was installed, and it's not useful to know the pathname as we'll soon encounter bigger problems to worry about.
There is no need to bother with constants like PATH_MAX. Reserving so much memory is overkill for almost all pathnames, and as noted in another answer, it's not guaranteed to be the actual upper limit anyway. For this application, we can pick a common-sense upper limit such as 16384. Even for applications with no common-sense upper limit, reallocating increasing powers of two is a good approach. You only need log n calls for a n-byte result, and the amount of memory capacity you waste is proportional to the length of the result. It also avoids race conditions where the length of the string changes between the realloc() and the readlink().