I got the following question from a past exam paper:
Consider the following source code:
using namespace std;
int main()
{
char dest[20];
printf( "%s\n", altcopy( dest, "demo" ) );
printf( "%s\n", altcopy( dest, "demo2" ) );
printf( "%s\n", altcopy( dest, "longer" ) );
printf( "%s\n", altcopy( dest, "even longer" ) );
printf( "%s\n", altcopy( dest, "and a really long string" ) );
}
Provide an implementation for the function called altcopy() which uses pointer arithmetic to copy alternate characters of a C-type string to the destination (i.e. the first, third, fifth etc character). Your answer must not use the [] operator to access an array index. The above code would then output the following:
dm
dm2
lne
ee ogr
adaral ogsrn
And I have attempted as follows:
using namespace std;
char* altcopy (char* dest, const char* str)
{
char* p = dest;
const char* q = str;
for ( int n=0; n<=20; n++ )
{
*p = *q;
p++;
q++;
q++;
}
return dest;
}
int main()
{
char dest[20];
printf( "%s\n", altcopy( dest, "demo" ) );
printf( "%s\n", altcopy( dest, "demo2" ) );
printf( "%s\n", altcopy( dest, "longer" ) );
printf( "%s\n", altcopy( dest, "even longer" ) );
printf( "%s\n", altcopy( dest, "and a really long string" ) );
}
And the results are:
dm
dm2lne
lne
ee ogradaral ogsrn
adaral ogsrn
I'm not sure why it happened to have duplicate of next statement result on certain output instead of performing as what the question asked for. Any help here?
Your function is invalid at least because it uses magic number 20.
The function should be similar to standard function strcpy That is it has to copy the source string until the terminating zero will be encountered.
Here is a simple function realization
#include <iostream>
char * altcopy( char *dest, const char *str )
{
char *p = dest;
while ( *p++ = *str++ )
{
if ( *str ) ++str;
}
return dest;
}
int main()
{
char dest[20];
std::cout << altcopy( dest, "demo" ) << std::endl;
std::cout << altcopy( dest, "demo2" ) << std::endl;
std::cout << altcopy( dest, "longer" ) << std::endl;
std::cout << altcopy( dest, "even longer" ) << std::endl;
std::cout << altcopy( dest, "and a really long string" ) << std::endl;
return 0;
}
The output is
dm
dm2
lne
ee ogr
adaral ogsrn
Enjoy!:)
Since in your loop:
for ( int n=0; n<=20; n++ )
{
*p = *q;
p++;
q++;
q++;
}
you are just looping 20 times regardless of the string length you are reading random memory from past the end of the string. In most cases you probably read a 0 after the real characters and write that next which terminates the string as far as printf is concerned, but sometimes because the way the strings happen to be stored in memory you are getting some characters from the next string.
It's because q++ is being done twice, and it skips over the terminating null, and into the next string.
In fact it doesn't even check for the terminating null.
Related
I have a strange behavior with a char pointer initialized by the value of a return function and with the cout.
All my code is for an Arduino application, this is why I use char pointer, char array and string.h.
I created a class named FrameManager, with a function getDataFromFrame to extract data from a string (in fact a char array). See above:
`char * FrameManager::getDataFromFrame ( const char frame[], char key[] )
{
char *pValue = nullptr;
int frameLength = strlen ( frame );
int previousStartIndex = 0;
for ( int i=0; i<frameLength; i++ ) {
char c = frame[i];
if ( c == ',' ) {
int buffSize = i-previousStartIndex+1;
char subbuff[buffSize];
memset ( subbuff, 0, buffSize ); //clear buffer
memcpy ( subbuff, &frame[previousStartIndex], i-previousStartIndex );
subbuff[buffSize]='\0';
previousStartIndex = i+1;
int buffLength = strlen ( subbuff );
const char *ptr = strchr ( subbuff, ':' );
if ( ptr ) {
int index = ptr-subbuff;
char buffKey[index+1];
memset ( buffKey, 0, index+1 );
memcpy ( buffKey, &subbuff[0], index );
buffKey[index+1]='\0';
char buffValue[buffLength-index];
memset ( buffValue, 0, buffLength-index );
memcpy ( buffValue, &subbuff[index+1], buffLength-index );
buffValue[buffLength-index]='\0';
if ( strcmp ( key,buffKey ) == 0 ) {
pValue = &buffValue[0];
break;
}
}
} else if ( i+1 == frameLength ) {
int buffSize = i-previousStartIndex+1;
char subbuff[buffSize];
memcpy ( subbuff, &frame[previousStartIndex], frameLength-1 );
subbuff[buffSize]='\0';
int buffLength = strlen ( subbuff );
const char *ptr = strchr ( subbuff, ':' );
if ( ptr ) {
int index = ptr-subbuff;
char buffKey[index+1];
memset ( buffKey, 0, index+1 );
memcpy ( buffKey, &subbuff[0], index );
buffKey[index+1]='\0';
char buffValue[buffLength-index];
memset ( buffValue, 0, buffLength-index );
memcpy ( buffValue, &subbuff[index+1], buffLength-index );
buffValue[buffLength-index]='\0';
if ( strcmp ( key,buffKey ) == 0 ) {
pValue = &buffValue[0];
break;
}
}
}
}
return pValue;
}`
In the main(), I created juste a little code to test the returned value:
int main(int argc, char **argv) {
const char frame[] = "DEVICE:ARM,FUNC:MOVE_F,PARAM:12,SERVO_S:1";
FrameManager frameManager;
char key[] = "DEVICE";
char *value;
value = frameManager.getDataFromFrame(frame, &key[0]);
cout << "Retrieved value: " << value << endl;
cout << "Retrieved value: " << frameManager.getDataFromFrame(frame, &key[0]) << endl;
printf("%s",value);
return 0;
}
and here the result:
Retrieved value: y%R
Retrieved value: ARM
ARM
The first "cout" doesn't display the expected value.
The second "cout" display the expected value and the printf too.
I don't understand what is the problem with the first "cout".
Thanks
Jocelyn
pValue points into local arrays, which get out of scope. That's undefined behavior. It might work, but your program might also crash, return wrong values (that's what you experience), corrupt your data or do any other arbitrary action.
Given that you're already using C++, consider using std::string as a result instead or point into the original frame (if possible).
On Python, there is this option errors='ignore' for the open Python function:
open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )
With this, reading a file with invalid UTF8 characters will replace them with nothing, i.e., they are ignored. For example, a file with the characthers Føö»BÃ¥r is going to be read as FøöBår.
If a line as Føö»BÃ¥r is read with getline() from stdio.h, it will be read as Føö�Bår:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
while( true )
{
if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
std::cerr << "readline=" readline << std::endl;
}
else {
break;
}
}
How can I make stdio.h getline() read it as FøöBår instead of Føö�Bår, i..e, ignoring invalid UTF8 characters?
One overwhelming solution I can think of it do iterate throughout all characters on each line read and build a new readline without any of these characters. For example:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
int index;
int charsread;
int invalidcharsoffset;
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
invalidcharsoffset = 0;
for( index = 0; index < charsread; ++index )
{
if( readline[index] != '�' ) {
fixedreadline[index-invalidcharsoffset] = readline[index];
}
else {
++invalidcharsoffset;
}
}
std::cerr << "fixedreadline=" << fixedreadline << std::endl;
}
else {
break;
}
}
Related questions:
Fixing invalid UTF8 characters
Replacing non UTF8 characters
python replace unicode characters
Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?
You are confusing what you see with what is really going on. The getline function does not do any replacement of characters. [Note 1]
You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.
Also, saying that a file contains the "characters Føö»BÃ¥r" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.
What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character \xbb, which is a guillemot (») in Windows code page 1252.
Finding all the invalid UTF-8 sequences in a string read by getline (or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but the mbtowc function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage, mbtowc returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.
Here's some lightly-tested example code (in C):
#include <stdlib.h>
#include <string.h>
/* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
* string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
* If the length of the converted string is less than 'len', a NUL byte is
* inserted.
* Returns the length of the possibly modified string (with a maximum of 'len'),
* not including the NUL terminator (if any).
* Requires that a UTF-8 locale be active; since there is no way to test for
* this condition, no attempt is made to do so. If the current locale is not UTF-8,
* behaviour is undefined.
*/
size_t remove_bad_utf8(char* s, size_t len) {
char* in = s;
/* Skip over the initial correct sequence. Avoid relying on mbtowc returning
* zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
*/
int seqlen;
while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
char* out = in;
if (len && seqlen < 0) {
++in;
--len;
/* If we find an invalid sequence, we need to start shifting correct sequences. */
for (; len; in += seqlen, len -= seqlen) {
seqlen = mbtowc(NULL, in, len);
if (seqlen > 0) {
/* Shift the valid sequence (if one was found) */
memmove(out, in, seqlen);
out += seqlen;
}
else if (seqlen < 0) seqlen = 1;
else /* (seqlen == 0) */ break;
}
*out++ = 0;
}
return out - s;
}
Notes
Aside from the possible line-end transformation of the underlying I/O library, which will replace CR-LF with a single \n on systems like Windows where the two character CR-LF sequence is used as a line-end indication.
As #rici well explains in his answer, there can be several invalid UTF-8 sequences in a byte sequence.
Possibly iconv(3) could be worth a look, e.g. see https://linux.die.net/man/3/iconv_open.
When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.
Example
This byte sequence, if interpreted as UTF-8, contains some invalid UTF-8:
"some invalid\xFE\xFE\xFF\xFF stuff"
If you display this you would see something like
some invalid���� stuff
When this string passes through the remove_invalid_utf8 function in the following C program, the invalid UTF-8 bytes are removed using the iconv function mentioned above.
So the result is then:
some invalid stuff
C Program
#include <stdio.h>
#include <iconv.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <errno.h>
char *remove_invalid_utf8(char *utf8, size_t len) {
size_t inbytes_len = len;
char *inbuf = utf8;
size_t outbytes_len = len;
char *result = calloc(outbytes_len + 1, sizeof(char));
char *outbuf = result;
iconv_t cd = iconv_open("UTF-8//IGNORE", "UTF-8");
if(cd == (iconv_t)-1) {
perror("iconv_open");
}
if(iconv(cd, &inbuf, &inbytes_len, &outbuf, &outbytes_len)) {
perror("iconv");
}
iconv_close(cd);
return result;
}
int main() {
char *utf8 = "some invalid\xFE\xFE\xFF\xFF stuff";
char *converted = remove_invalid_utf8(utf8, strlen(utf8));
printf("converted: %s to %s\n", utf8, converted);
free(converted);
return 0;
}
I also managed to fix it by trailing/cutting down all Non-ASCII characters.
This one takes about 2.6 seconds to parse 319MB:
#include <stdlib.h>
#include <iostream>
int main(int argc, char const *argv[])
{
FILE* cfilestream = fopen( "./test.txt", "r" );
size_t linebuffersize = 131072;
if( cfilestream == NULL ) {
perror( "fopen cfilestream" );
return -1;
}
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
if( fixedreadline == NULL ) {
perror( "malloc fixedreadline" );
return -1;
}
char* source;
if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
perror( "setlocale" );
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
int index;
int charsread;
int invalidcharsoffset;
unsigned int fixedchar;
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
invalidcharsoffset = 0;
for( index = 0; index < charsread; ++index )
{
fixedchar = static_cast<unsigned int>( readline[index] );
// std::cerr << "index " << std::setw(3) << index
// << " readline " << std::setw(10) << fixedchar
// << " -> '" << readline[index] << "'" << std::endl;
if( 31 < fixedchar && fixedchar < 128 ) {
fixedreadline[index-invalidcharsoffset] = readline[index];
}
else {
++invalidcharsoffset;
}
}
fixedreadline[index-invalidcharsoffset] = '\0';
// std::cerr << "fixedreadline=" << fixedreadline << std::endl;
}
else {
break;
}
}
std::cerr << "fixedreadline=" << fixedreadline << std::endl;
free( readline );
free( fixedreadline );
fclose( cfilestream );
return 0;
}
Alternative and slower version using memcpy
Using menmove does not improve much speed, so you could either one.
This one takes about 3.1 seconds to parse 319MB:
#include <stdlib.h>
#include <iostream>
#include <cstring>
#include <iomanip>
int main(int argc, char const *argv[])
{
FILE* cfilestream = fopen( "./test.txt", "r" );
size_t linebuffersize = 131072;
if( cfilestream == NULL ) {
perror( "fopen cfilestream" );
return -1;
}
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
if( fixedreadline == NULL ) {
perror( "malloc fixedreadline" );
return -1;
}
char* source;
char* destination;
char* finalresult;
int index;
int lastcopy;
int charsread;
int charstocopy;
int invalidcharsoffset;
bool hasignoredbytes;
unsigned int fixedchar;
if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
perror( "setlocale" );
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
hasignoredbytes = false;
source = readline;
destination = fixedreadline;
lastcopy = 0;
invalidcharsoffset = 0;
for( index = 0; index < charsread; ++index )
{
fixedchar = static_cast<unsigned int>( readline[index] );
// std::cerr << "fixedchar " << std::setw(10)
// << fixedchar << " -> '"
// << readline[index] << "'" << std::endl;
if( 31 < fixedchar && fixedchar < 128 ) {
if( hasignoredbytes ) {
charstocopy = index - lastcopy - invalidcharsoffset;
memcpy( destination, source, charstocopy );
source += index - lastcopy;
lastcopy = index;
destination += charstocopy;
invalidcharsoffset = 0;
hasignoredbytes = false;
}
}
else {
++invalidcharsoffset;
hasignoredbytes = true;
}
}
if( destination != fixedreadline ) {
charstocopy = charsread - static_cast<int>( source - readline )
- invalidcharsoffset;
memcpy( destination, source, charstocopy );
destination += charstocopy - 1;
if( *destination == '\n' ) {
*destination = '\0';
}
else {
*++destination = '\0';
}
finalresult = fixedreadline;
}
else {
finalresult = readline;
}
// std::cerr << "finalresult=" << finalresult << std::endl;
}
else {
break;
}
}
std::cerr << "finalresult=" << finalresult << std::endl;
free( readline );
free( fixedreadline );
fclose( cfilestream );
return 0;
}
Optimized solution using iconv
This takes about 4.6 seconds to parse 319MB of text.
#include <iconv.h>
#include <string.h>
#include <stdlib.h>
#include <iostream>
// Compile it with:
// g++ -o main test.cpp -O3 -liconv
int main(int argc, char const *argv[])
{
FILE* cfilestream = fopen( "./test.txt", "r" );
size_t linebuffersize = 131072;
if( cfilestream == NULL ) {
perror( "fopen cfilestream" );
return -1;
}
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
if( fixedreadline == NULL ) {
perror( "malloc fixedreadline" );
return -1;
}
char* source;
char* destination;
int charsread;
size_t inchars;
size_t outchars;
if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
perror( "setlocale" );
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
iconv_t conversiondescriptor = iconv_open("UTF-8//IGNORE", "UTF-8");
if( conversiondescriptor == (iconv_t)-1 ) {
perror( "iconv_open conversiondescriptor" );
}
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
source = readline;
inchars = charsread;
destination = fixedreadline;
outchars = charsread;
if( iconv( conversiondescriptor, &source, &inchars, &destination, &outchars ) )
{
perror( "iconv" );
}
// Trim out the new line character
if( *--destination == '\n' ) {
*--destination = '\0';
}
else {
*destination = '\0';
}
// std::cerr << "fixedreadline='" << fixedreadline << "'" << std::endl;
}
else {
break;
}
}
std::cerr << "fixedreadline='" << fixedreadline << "'" << std::endl;
free( readline );
free( fixedreadline );
if( fclose( cfilestream ) ) {
perror( "fclose cfilestream" );
}
if( iconv_close( conversiondescriptor ) ) {
perror( "iconv_close conversiondescriptor" );
}
return 0;
}
Slowest solution ever using mbtowc
This takes about 24.2 seconds to parse 319MB of text.
If you comment out the line fixedchar = mbtowc(NULL, source, charsread); and uncomment the line charsread -= fixedchar; (breaking the invalid characters removal) this will take 1.9 seconds instead of 24.2 seconds (also compiled with -O3 optimization level).
#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <cstring>
#include <iomanip>
int main(int argc, char const *argv[])
{
FILE* cfilestream = fopen( "./test.txt", "r" );
size_t linebuffersize = 131072;
if( cfilestream == NULL ) {
perror( "fopen cfilestream" );
return -1;
}
char* readline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
char* source;
char* lineend;
char* destination;
int charsread;
int fixedchar;
if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
perror( "setlocale" );
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
lineend = readline + charsread;
destination = readline;
for( source = readline; source != lineend; )
{
// fixedchar = 1;
fixedchar = mbtowc(NULL, source, charsread);
charsread -= fixedchar;
// std::ostringstream contents;
// for( int index = 0; index < fixedchar; ++index )
// contents << source[index];
// std::cerr << "fixedchar=" << std::setw(10)
// << fixedchar << " -> '"
// << contents.str().c_str() << "'" << std::endl;
if( fixedchar > 0 ) {
memmove( destination, source, fixedchar );
source += fixedchar;
destination += fixedchar;
}
else if( fixedchar < 0 ) {
source += 1;
// std::cerr << "errno=" << strerror( errno ) << std::endl;
}
else {
break;
}
}
// Trim out the new line character
if( *--destination == '\n' ) {
*--destination = '\0';
}
else {
*destination = '\0';
}
// std::cerr << "readline='" << readline << "'" << std::endl;
}
else {
break;
}
}
std::cerr << "readline='" << readline << "'" << std::endl;
if( fclose( cfilestream ) ) {
perror( "fclose cfilestream" );
}
free( readline );
return 0;
}
Fastest version from all my others above using memmove
You cannot use memcpy here because the memory regions overlap!
This takes about 2.4 seconds to parse 319MB.
If you comment out the lines *destination = *source and memmove( destination, source, 1 ) (breaking the invalid characters removal) the performance still almost the same as when memmove is being called. Here in, calling memmove( destination, source, 1 ) is a little slower than directly doing *destination = *source;
#include <stdlib.h>
#include <iostream>
#include <cstring>
#include <iomanip>
int main(int argc, char const *argv[])
{
FILE* cfilestream = fopen( "./test.txt", "r" );
size_t linebuffersize = 131072;
if( cfilestream == NULL ) {
perror( "fopen cfilestream" );
return -1;
}
char* readline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
char* source;
char* lineend;
char* destination;
int charsread;
unsigned int fixedchar;
if( ( source = std::setlocale( LC_ALL, "en_US.utf8" ) ) == NULL ) {
perror( "setlocale" );
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
lineend = readline + charsread;
destination = readline;
for( source = readline; source != lineend; ++source )
{
fixedchar = static_cast<unsigned int>( *source );
// std::cerr << "fixedchar=" << std::setw(10)
// << fixedchar << " -> '" << *source << "'" << std::endl;
if( 31 < fixedchar && fixedchar < 128 ) {
*destination = *source;
++destination;
}
}
// Trim out the new line character
if( *source == '\n' ) {
*--destination = '\0';
}
else {
*destination = '\0';
}
// std::cerr << "readline='" << readline << "'" << std::endl;
}
else {
break;
}
}
std::cerr << "readline='" << readline << "'" << std::endl;
if( fclose( cfilestream ) ) {
perror( "fclose cfilestream" );
}
free( readline );
return 0;
}
Bonus
You can also use Python C Extensions (API).
It takes about 2.3 seconds to parse 319MB without converting them to cached version UTF-8 char*
And takes about 3.2 seconds to parse 319MB converting them to UTF-8 char*.
And also takes about 3.2 seconds to parse 319MB converting them to cached ASCII char*.
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <iostream>
typedef struct
{
PyObject_HEAD
}
PyFastFile;
static PyModuleDef fastfilepackagemodule =
{
// https://docs.python.org/3/c-api/module.html#c.PyModuleDef
PyModuleDef_HEAD_INIT,
"fastfilepackage", /* name of module */
"Example module that wrapped a C++ object", /* module documentation, may be NULL */
-1, /* size of per-interpreter state of the module, or
-1 if the module keeps state in global variables. */
NULL, /* PyMethodDef* m_methods */
NULL, /* inquiry m_reload */
NULL, /* traverseproc m_traverse */
NULL, /* inquiry m_clear */
NULL, /* freefunc m_free */
};
// initialize PyFastFile Object
static int PyFastFile_init(PyFastFile* self, PyObject* args, PyObject* kwargs) {
char* filepath;
if( !PyArg_ParseTuple( args, "s", &filepath ) ) {
return -1;
}
int linecount = 0;
PyObject* iomodule;
PyObject* openfile;
PyObject* fileiterator;
iomodule = PyImport_ImportModule( "builtins" );
if( iomodule == NULL ) {
std::cerr << "ERROR: FastFile failed to import the io module '"
"(and open the file " << filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
PyObject* openfunction = PyObject_GetAttrString( iomodule, "open" );
if( openfunction == NULL ) {
std::cerr << "ERROR: FastFile failed get the io module open "
<< "function (and open the file '" << filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
openfile = PyObject_CallFunction(
openfunction, "ssiss", filepath, "r", -1, "ASCII", "ignore" );
if( openfile == NULL ) {
std::cerr << "ERROR: FastFile failed to open the file'"
<< filepath << "'!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
PyObject* iterfunction = PyObject_GetAttrString( openfile, "__iter__" );
Py_DECREF( openfunction );
if( iterfunction == NULL ) {
std::cerr << "ERROR: FastFile failed get the io module iterator"
<< "function (and open the file '" << filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
PyObject* openiteratorobject = PyObject_CallObject( iterfunction, NULL );
Py_DECREF( iterfunction );
if( openiteratorobject == NULL ) {
std::cerr << "ERROR: FastFile failed get the io module iterator object"
<< " (and open the file '" << filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
fileiterator = PyObject_GetAttrString( openfile, "__next__" );
Py_DECREF( openiteratorobject );
if( fileiterator == NULL ) {
std::cerr << "ERROR: FastFile failed get the io module iterator "
<< "object (and open the file '" << filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
PyObject* readline;
while( ( readline = PyObject_CallObject( fileiterator, NULL ) ) != NULL ) {
linecount += 1;
PyUnicode_AsUTF8( readline );
Py_DECREF( readline );
// std::cerr << "linecount " << linecount << " readline '" << readline
// << "' '" << PyUnicode_AsUTF8( readline ) << "'" << std::endl;
}
std::cerr << "linecount " << linecount << std::endl;
// PyErr_PrintEx(100);
PyErr_Clear();
PyObject* closefunction = PyObject_GetAttrString( openfile, "close" );
if( closefunction == NULL ) {
std::cerr << "ERROR: FastFile failed get the close file function for '"
<< filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
PyObject* closefileresult = PyObject_CallObject( closefunction, NULL );
Py_DECREF( closefunction );
if( closefileresult == NULL ) {
std::cerr << "ERROR: FastFile failed close open file '"
<< filepath << "')!" << std::endl;
PyErr_PrintEx(100);
return -1;
}
Py_DECREF( closefileresult );
Py_XDECREF( iomodule );
Py_XDECREF( openfile );
Py_XDECREF( fileiterator );
return 0;
}
// destruct the object
static void PyFastFile_dealloc(PyFastFile* self) {
Py_TYPE(self)->tp_free( (PyObject*) self );
}
static PyTypeObject PyFastFileType =
{
PyVarObject_HEAD_INIT( NULL, 0 )
"fastfilepackage.FastFile" /* tp_name */
};
// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
PyObject* thismodule;
// https://docs.python.org/3/c-api/typeobj.html
PyFastFileType.tp_new = PyType_GenericNew;
PyFastFileType.tp_basicsize = sizeof(PyFastFile);
PyFastFileType.tp_dealloc = (destructor) PyFastFile_dealloc;
PyFastFileType.tp_flags = Py_TPFLAGS_DEFAULT;
PyFastFileType.tp_doc = "FastFile objects";
PyFastFileType.tp_init = (initproc) PyFastFile_init;
if( PyType_Ready( &PyFastFileType) < 0 ) {
return NULL;
}
thismodule = PyModule_Create(&fastfilepackagemodule);
if( thismodule == NULL ) {
return NULL;
}
// Add FastFile class to thismodule allowing the use to create objects
Py_INCREF( &PyFastFileType );
PyModule_AddObject( thismodule, "FastFile", (PyObject*) &PyFastFileType );
return thismodule;
}
To built it, create the file source/fastfilewrappar.cpp with the contents of the above file and the setup.py with the following contents:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
from setuptools import setup, Extension
myextension = Extension(
language = "c++",
extra_link_args = ["-std=c++11"],
extra_compile_args = ["-std=c++11"],
name = 'fastfilepackage',
sources = [
'source/fastfilewrapper.cpp'
],
include_dirs = [ 'source' ],
)
setup(
name = 'fastfilepackage',
ext_modules= [ myextension ],
)
To run example, use following Python script:
import time
import datetime
import fastfilepackage
testfile = './test.txt'
timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
Example:
user#user-pc$ /usr/bin/pip3.6 install .
Processing /fastfilepackage
Building wheels for collected packages: fastfilepackage
Building wheel for fastfilepackage (setup.py) ... done
Stored in directory: /pip-ephem-wheel-cache-j313cpzc/wheels/e5/5f/bc/52c820
Successfully built fastfilepackage
Installing collected packages: fastfilepackage
Found existing installation: fastfilepackage 0.0.0
Uninstalling fastfilepackage-0.0.0:
Successfully uninstalled fastfilepackage-0.0.0
Successfully installed fastfilepackage-0.0.0
user#user-pc$ /usr/bin/python3.6 fastfileperformance.py
linecount 820800
FastFile timedifference 0:00:03.204614
Using std::getline
This takes about 4.7 seconds to parse 319MB.
If you remove the UTF-8 removal algorithm borrowed from the fastest benchmark using stdlib.h getline(), it takes 1.7 seconds to run.
#include <stdlib.h>
#include <iostream>
#include <locale>
#include <fstream>
#include <iomanip>
int main(int argc, char const *argv[])
{
unsigned int fixedchar;
int linecount = -1;
char* source;
char* lineend;
char* destination;
if( ( source = setlocale( LC_ALL, "en_US.ascii" ) ) == NULL ) {
perror( "setlocale" );
return -1;
}
else {
std::cerr << "locale='" << source << "'" << std::endl;
}
std::ifstream fileifstream{ "./test.txt" };
if( fileifstream.fail() ) {
std::cerr << "ERROR: FastFile failed to open the file!" << std::endl;
return -1;
}
size_t linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
if( readline == NULL ) {
perror( "malloc readline" );
return -1;
}
while( true )
{
if( !fileifstream.eof() )
{
linecount += 1;
fileifstream.getline( readline, linebuffersize );
lineend = readline + fileifstream.gcount();
destination = readline;
for( source = readline; source != lineend; ++source )
{
fixedchar = static_cast<unsigned int>( *source );
// std::cerr << "fixedchar=" << std::setw(10)
// << fixedchar << " -> '" << *source << "'" << std::endl;
if( 31 < fixedchar && fixedchar < 128 ) {
*destination = *source;
++destination;
}
}
// Trim out the new line character
if( *source == '\n' ) {
*--destination = '\0';
}
else {
*destination = '\0';
}
// std::cerr << "readline='" << readline << "'" << std::endl;
}
else {
break;
}
}
std::cerr << "linecount='" << linecount << "'" << std::endl;
if( fileifstream.is_open() ) {
fileifstream.close();
}
free( readline );
return 0;
}
Resume
2.6 seconds trimming UTF-8 using two buffers with indexing
3.1 seconds trimming UTF-8 using two buffers with memcpy
4.6 seconds removing invalid UTF-8 with iconv
24.2 seconds removing invalid UTF-8 with mbtowc
2.4 seconds trimming UTF-8 using one buffer with pointer direct assigning
Bonus
2.3 seconds removing invalid UTF-8 without converting them to a cached UTF-8 char*
3.2 seconds removing invalid UTF-8 converting them to a cached UTF-8 char*
3.2 seconds trimming UTF-8 and caching as ASCII char*
4.7 seconds trimming UTF-8 with std::getline() using one buffer with pointer direct assigning
The used file ./text.txt had 820.800 lines where each line was equal to:
id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char\r\n
And all versions where compiled with
g++ (GCC) 7.4.0
iconv (GNU libiconv 1.14)
g++ -o main test.cpp -O3 -liconv && time ./main
Okay, so I know that disk writing errors are very rare, so please just look past that because the data I am working with is very incredibly important (like SSIDs kind of important). So, I want to copy a file in the absolute most robust way using the absolute minimal amount of memory to do so. So far, this is was far as I have got. It sucks up a lot of memory, but I can't find the source. The way it works is by rechecking tons of times until it gets a confirmed result (it may increase the number of false positives for errors by a lot, but it might reduce the chance of an actual error by a big margin). Also, the sleep at the bottom is so you have time to analyze the programs overall performance using the windows task manager.
#include <cstdio> // fopen, fclose, fread, fwrite, BUFSIZ
#include <cstdlib>
#include <unistd.h>
#include <iostream>
using namespace std;
__inline__ bool copy_file(const char* From, const char* To)
{
FILE infile = (*fopen(From, "rb"));
FILE outfile = (*fopen(To, "rwb+"));
setvbuf( &infile, nullptr, _IONBF, 0);
setvbuf( &outfile, nullptr, _IONBF, 0);
fseek(&infile,0,SEEK_END);
long int size = ftell(&infile);
fseek(&infile,0,SEEK_SET);
unsigned short error_amount;
bool success;
char c;
char w;
char l;
for ( fpos_t i=0; (i != size); ++i ) {
error_amount=0;
fsetpos( &infile, &i );
c = fgetc(&infile);
fsetpos( &infile, &i );
success=true;
for ( l=0; (l != 126); ++l ) {
fsetpos( &infile, &i );
success = ( success == ( fgetc(&infile)==c ) );
}
while (success==false) {
fsetpos( &infile, &i );
if (error_amount==32767) {
cerr << "There were 32768 failed attemps at accessing a part of the file! exiting the program...";
return false;
}
++error_amount;
//cout << "an error has occured at position ";
//printf("%d in the file.\n", (int)i);
c = fgetc(&infile);
fsetpos( &infile, &i );
success=true;
for ( l=0; (l != 126); ++l ) {
fsetpos( &infile, &i );
success = ( success == ( fgetc(&infile)==c ) );
}
}
fsetpos( &infile, &i );
fputc( c, &outfile);
fsetpos( &outfile, &i );
error_amount=0;
w = fgetc(&infile);
fsetpos( &outfile, &i );
success=true;
for ( l=0; (l != 126); ++l ) {
fsetpos( &outfile, &i );
success = ( success == ( fgetc(&outfile)==w ) );
}
while (success==false) {
fsetpos( &outfile, &i );
fputc( c, &outfile);
if (error_amount==32767) {
cerr << "There were 32768 failed attemps at writing to a part of the file! exiting the program...";
return false;
}
++error_amount;
w = fgetc(&infile);
fsetpos( &infile, &i );
success=true;
for ( l=0; (l != 126); ++l ) {
fsetpos( &outfile, &i );
success = ( success == ( fgetc(&outfile)==w ) );
}
}
fsetpos( &infile, &i );
}
fclose(&infile);
fclose(&outfile);
return true;
}
int main( void )
{
int CopyResult = copy_file("C:\\Users\\Admin\\Desktop\\example file.txt","C:\\Users\\Admin\\Desktop\\example copy.txt");
std::cout << "Could it copy the file? " << CopyResult << '\n';
sleep(65535);
return 1;
}
So, if my code is on the right track with the best way, then what can be done with my code to improve it? But, if my code is totally off with the best solution, then what is the best solution? Please note that this question is essentially about detection of rare disk writing errors for the application of copying very very very very (etc.) important data.
I would just copy the file without any special checks, and in the end I would read the file and compare its hash value to the expected one. For a hash function, I would use MD5 or SHA-1.
#include <boost/filesystem.hpp>
#include <iostream>
int main()
{
try
{
boost::filesystem::copy_file( "C:\\Users\\Admin\\Desktop\\example file.txt",
"C:\\Users\\Admin\\Desktop\\example copy.txt" );
}
catch ( boost::filesystem::filesystem_error const & ex )
{
std::cerr << "Copy failed: " << ex.what();
}
}
This will call the arguably most robust implementation available -- the one provided by the operating system -- and report any failure.
My point being:
The chance of having your saved data end up corrupted are astronomically small to begin with.
Any application where this might actually be an issue should be running on redundant storage, a.k.a. RAID arrays, filesystems doing checksums (like Btrfs, ZFS) etc., again reducing chance of failure significantly.
Doing complex things in home-grown I/O functions, on the other hand, increases the probability of mistakes and / or false negatives immensely.
I am going through and comparing a bunch of DNA sequences to find if it is a subset of another. I remove those that are subsets of another.
I'm using a linked list and I keep getting a segmentation fault somewhere around the output of the data back to the output file.
I'd also greatly appreciate feedback on overall code structure. I know its rather messy so I figured someone could point out some things that should be improved on.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
using namespace std;
/*
* Step 1. Load all sequences and their metadata into structures.
*
* Step 2. Start n^2 operation to compare sequences.
*
* Step 3. Output file back to a different fasta file.
*/
typedef struct sequence_structure sequence_structure;
struct sequence_structure
{
char *sequence;
char *id;
char *header;
sequence_structure *next_sequence_structure;
sequence_structure *previous_sequence_structure;
int length;
};
int main(int argc, char *argv[])
{
FILE *input_file;
ofstream output_file;
/* this is the TAIL of the linked list. This is a reversed linked list. */
sequence_structure *sequences;
int first_sequence = 0;
char *line = (char*) malloc( sizeof( char ) * 1024 );
if( argc != 3 )
{
printf("This program requires a input file and output file as its argument!\n");
return 0;
}
else
{
/* let's read the input file. */
input_file = fopen( argv[1], "r" );
}
while( !feof(input_file) )
{
string string_line;
fgets( line, 2048, input_file );
string_line = line;
if( string_line.length() <= 2 )
break;
if( string_line.at( 0 ) == '>' )
{
sequence_structure *new_sequence = (sequence_structure *) malloc( sizeof( sequence_structure ) );
new_sequence->id = (char *) malloc( sizeof( char ) * ( 14 + 1 ) );
string_line.copy( new_sequence->id, 14, 1 );
(new_sequence->id)[14] = '\0';
stringstream ss ( string_line.substr( 23, 4 ) );
ss >> new_sequence->length;
new_sequence->header = (char *) malloc( sizeof(char) * ( string_line.length() + 1 ) );
string_line.copy( new_sequence->header, string_line.length(), 0 );
(new_sequence->header)[string_line.length()] = '\0';
fgets( line, 2048, input_file );
string_line = line;
new_sequence->sequence = (char *) malloc( sizeof(char) * ( string_line.length() + 1 ) );
string_line.copy( new_sequence->sequence, string_line.length(), 0 );
(new_sequence->sequence)[string_line.length()] = '\0';
if( first_sequence == 0 )
{
sequences = new_sequence;
sequences->previous_sequence_structure = NULL;
first_sequence = 1;
}
else
{
sequences->next_sequence_structure = new_sequence;
new_sequence->previous_sequence_structure = sequences;
sequences = new_sequence;
}
}
else
{
cout << "Error: input file reading error." << endl;
}
}
fclose( input_file );
free( line );
sequence_structure *outer_sequence_node = sequences;
while( outer_sequence_node != NULL )
{
sequence_structure *inner_sequence_node = sequences;
string outer_sequence ( outer_sequence_node->sequence );
while( inner_sequence_node != NULL )
{
string inner_sequence ( inner_sequence_node->sequence );
if( outer_sequence_node->length > inner_sequence_node->length )
{
if( outer_sequence.find( inner_sequence ) != std::string::npos )
{
cout << "Deleting the sequence with id: " << inner_sequence_node->id << endl;
cout << inner_sequence_node->sequence << endl;
cout << "Found within the sequence with id: " << outer_sequence_node->id << endl;
cout << outer_sequence_node->sequence << endl;
sequence_structure *previous_sequence = inner_sequence_node->previous_sequence_structure;
sequence_structure *next_sequence = inner_sequence_node->next_sequence_structure;
free( inner_sequence_node->id );
free( inner_sequence_node->sequence );
free( inner_sequence_node->header );
if( next_sequence != NULL )
next_sequence->previous_sequence_structure = previous_sequence;
if( previous_sequence != NULL )
{
inner_sequence_node = previous_sequence;
free( previous_sequence->next_sequence_structure );
previous_sequence->next_sequence_structure = next_sequence;
}
}
}
inner_sequence_node = inner_sequence_node->previous_sequence_structure;
}
outer_sequence_node = outer_sequence_node->previous_sequence_structure;
}
output_file.open( argv[2], ios::out );
while( sequences->previous_sequence_structure != NULL )
{
sequences = sequences->previous_sequence_structure;
}
sequence_structure *current_sequence = sequences;
while( current_sequence->next_sequence_structure != NULL )
{
output_file << current_sequence->header;
output_file << current_sequence->sequence;
current_sequence = current_sequence->next_sequence_structure;
}
output_file << current_sequence->header;
output_file << current_sequence->sequence;
output_file.close();
while( sequences != NULL )
{
cout << "Freeing sequence with this id: " << sequences->id << endl;
free( sequences->id );
free( sequences->header );
free( sequences->sequence );
if( sequences->next_sequence_structure != NULL )
{
sequences = sequences->next_sequence_structure;
free( sequences->previous_sequence_structure );
}
else
{
sequences = NULL;
}
}
return 0;
}
I implement a function that acts like getline( .. ). So my initial approach is:
#include <cstdio>
#include <cstdlib>
#include <cstring>
void getstr( char*& str, unsigned len ) {
char c;
size_t i = 0;
while( true ) {
c = getchar(); // get a character from keyboard
if( '\n' == c || EOF == c ) { // if encountering 'enter' or 'eof'
*( str + i ) = '\0'; // put the null terminate
break; // end while
}
*( str + i ) = c;
if( i == len - 1 ) { // buffer full
len = len + len; // double the len
str = ( char* )realloc( str, len ); // reallocate memory
}
++i;
}
}
int main() {
const unsigned DEFAULT_SIZE = 4;
char* str = ( char* )malloc( DEFAULT_SIZE * sizeof( char ) );
getstr( str, DEFAULT_SIZE );
printf( str );
free( str );
return 0;
}
Then, I think I should switch to pure C instead of using half C/C++. So I change char*& to char**:
Pointer to Pointer version ( crahsed )
#include <cstdio>
#include <cstdlib>
#include <cstring>
void getstr( char** str, unsigned len ) {
char c;
size_t i = 0;
while( true ) {
c = getchar(); // get a character from keyboard
if( '\n' == c || EOF == c ) { // if encountering 'enter' or 'eof'
*( *str + i ) = '\0'; // put the null terminate
break; // done input end while
}
*( *str + i ) = c;
if( i == len - 1 ) { // buffer full
len = len + len; // double the len
*str = ( char* )realloc( str, len ); // reallocate memory
}
++i;
}
}
int main() {
const unsigned DEFAULT_SIZE = 4;
char* str = ( char* )malloc( DEFAULT_SIZE * sizeof( char ) );
getstr( &str, DEFAULT_SIZE );
printf( str );
free( str );
return 0;
}
But this version crashed, ( access violation ). I tried run the debugger, but I could not find where it crashed. I'm running Visual Studio 2010 so could you guys show me how to fix it?
Another weird thing I've encountered is that, if I leave the "&" out, it only works with Visual Studio, but not g++. That is
void getstr( char* str, unsigned len )
From my understanding, whenever we use pointer to allocate or deallocate a block of memory, we actually modify where that pointer are pointing to. So I think we have to use either ** or *& to modify the pointer. However, because it run correctly in Visual Studio, is it just luck or it should be ok either way?
Then, I think I should switch to pure C instead of using half C/C++.
I suggest the other direction. Go full-blown C++.
Your pointer crash is probably in the realloc
*str = ( char* )realloc( str, len )
Should be
*str = ( char* )realloc( *str, len )
As Steve points out, your code leaks the original if realloc fails, so maybe change it to something like:
char* tmp = (char*) realloc(*str, len)
if (tmp) {
*str = tmp
} else {
// realloc failed.. sigh
}
Well, running it in a debugger highlights this line
*str = ( char* )realloc( str, len ); // reallocate memory
where there is a mismatch between str - the pointer to the variable - and *str - the pointer to the memory.
I'd be tempted to rewrite it so it returns the string, or zero on error, rather than having a void return and an in/out parameter ( like fgets does, which seems to be the function you're sort-of copying the behaviour of ). Or wrap such a function. That style doesn't let you get confused as you're only ever dealing with a pointer to char, rather than a pointer to pointer to char.
char* getstr_impl ( char* str, unsigned len ) {...}
void getstr( char** str, unsigned len ) {
*str = getstr_impl ( *str, len );
}