Big File reading error in C++

Big File reading error in C++ - c++

I need to read a file in c++ that has this specific format:
10 5
1 2 3 4 1 5 1 5 2 1
All the values are separated with a space. The first 2 on the first line are the variables N and M respectively and all the N values from the second line need to be in an array called S with the size of N. The code I have written has no problem with files like these but it does not work when it comes to really big files with millions and so on that i need it to work with. Here is the code
int N,M;
FILE *read = fopen("file.in", "r");
fscanf(read, "%d %d ", &N, &M);
int S[N];
for( i =0; i < N; i++){
fscanf(read, "%d ", &S[i]);
}
What should I change?

There are multiple potential issues when getting in the range of millions of integers:
int is most often 32 bits, a 32 bits signed integer will have a range of -2^31 to 2^31 - 1, and thus the maximum of 2,147,483,647. You should switch to a 64 bits integral.
You are using int S[N] a Variable Length Array (VLA) which is not Standard C++ (it is Standard C99, but... there are discussions as to whether it was a good idea or not). The important detail, though, is that a VLA is stored on the stack: 1 million of 32 bits int is 4 MB, 2 millions is 8 MB, etc... check your default stack size, but it likely is less than 8 MB, and thus you have a stack-overflow (you're on the right site for help!).
So, let's switch to C++ and do away with those issues:
#include <cstdint> // for int64_t
#include <fstream>
#include <vector>
int main(int argc, char* argv[]) {
std::ifstream stream("data.txt");
int64_t n = 0, m = 0;
stream >> n >> m;
std::vector<int> data;
for (int64_t c = 0; c != n; ++c) {
int i = 0;
stream >> i;
data.push_back(i);
}
// do your best :)
}
First of all, we use int64_t from <cstdint> to do away with the integer overflow issue. Second, we use a stream (input file stream: ifstream) to avoid having to learn what is the format associated with each and every integral type (it's a pain). Third, we use a vector to store the data we read, and do away with the stack overflow issue.

You are using variable sized arrays. This is not standard and not supported by all compilers. If your compiler support it, and you go in the millions, you'll run out of stack space (stack overflow).
Alternatively, you could define S as being a vector with vector<int> S(N);

Related

Why does this loop run?

#include<iostream>
#include<string>
#include<vector>
using namespace std;
int main()
{
std::string qaz{};
vector <size_t> index ;
cout <<"qaz: "<<qaz<<" length: "<<qaz.length()<<"\n";
for (size_t i{0}; i<= ( qaz.length()-2);i++ )
{ cout<<"Entered"<<i<<"\n";
cout<<"Exited"<<i<<"\n";}
return 0;
}
//Here qaz is an empty string so qaz.length() == 0 (so qaz.length()-2 == -2) and i is initialized to 0 so I expected that we will not enter the loop. But on running it I find that it goes on in an infinite loop. Why? Please help me with it.

See docs for size_t:
std::size_t is the unsigned integer type of the result of the sizeof operator
(Emphasis mine.)
Furthermore, string::length returns a size_t too1.
But even if that were not the case, when comparing signed values to unsigned values, the signed value is converted to unsigned before the comparison, as explained in this answer.
(size_t)0 - 2 will underflow as size_t is unsigned and therefore its minimum value is zero resulting in a large number which is usually2 either 232-2 or 264-2 depending on the processor architecture. Let's go with the latter, then you will get 18,446,744,073,709,552,000 as result.
Now, looking at the result of 0 <= 18446744073709552000 you can see that zero is clearly less than or equal to 18.4 quintillion, so the loop condition is fulfilled. In fact the loop is not infinite, it will loop exactly 18,446,744,073,709,552,001 times, but it's true you will probably not want to wait for it to finally reach its finishing point.
The solution is to avoid the underflow by comparing i + y <= x instead of i <= x - y3, i.e. i + 2 <= qaz.length(). You will then have 2 <= 0 which is false.
1: Technically, it returns an std::allocator<char>::size_type but that is defined as std::size_t.
2: To be exact, it is SIZE_MAX - (2 - 1) i.e. SIZE_MAX - 1 (see limits). In terms of numeric value, it could also be 216-2 - such as on an ATmega328P microcontroller - or some other value, but on the architectures you get on desktop computers at the current point in time it's most likely one of the two I mentioned. It depends on the width of the std::size_t type. If it's X bits wide, you'd get 2X-n for (size_t)0 - n for 0<n<2X. Since C++11 it is however guaranteed that std::size_t is no less than 16 bits wide.
3: However, in the unlikely case that your length is very large, specifically at least the number calculated above with 2X-2 or larger, this would result in an overflow instead. But in that case your whole logic would be flawed and you'd need a different approach. I think this can't be the case anyway because std::ssize support means that string lengths would have to have one unused bit to be repurposed as sign bit, but I think this answer went down various rabbit holes far enough already.

length() returns unsigned value, which cannot be below zero. 0u - 2 wraps around and becomes very large number.
Use i + 2 <= qaz.length() instead.

The issue is that size_t is unsigned. length() returns the strings size_type which is unsigned and most likely also size_t. When the strings size is <2 then length() -2 wraps around to yield a large unsigned value.
Since C++20 there is std::ssize which returns a signed value. Though you also have to adjust the type of i to get correct number of iterations also when i < -2 is the condition:
#include<iostream>
#include<string>
#include<vector>
using namespace std;
int main()
{
std::string qaz{};
vector <size_t> index ;
cout <<"qaz: "<<qaz<<" length: "<<qaz.length()<<"\n";
for (int i{0}; i<= ( std::ssize(qaz)-2);i++ )
{
cout<<"Entered"<<i<<"\n";
cout<<"Exited"<<i<<"\n";
}
}
Alternatively stay with unsigneds and use i+2 <= qaz.length().

Custom data size for memory alignment

Each datatype has a certain range, based on the hardware. For example, on a 32bit machine an int has the range -2147483648 to 2147483647.
C++ compilers 'pad' object memory to fit into certain sizes. I'm pretty sure it's 2, 4, 8, 16, 32, 64 etc. This also probably depends on the machine.
I want to manually align my objects to meet padding requirements. Is there a way to:
Determine what machine a program is running on
Determine padding sizes
Set custom data-type based on bitsize
I've used bitsets before in Java, but I'm not familiar with C++. As for machine requirements, I know programs for different hardware sets are usually compiled differently in C++, so I'm wondering if its even possible at all.
Example->
/*getHardwarePack size obviously doesn't exist, just here to explain. What I'm trying to get
here would be the minimum alignment size for the machine the program is running on*/
#define PACK_SIZE = getHardwarePackSize();
#define MONTHS = 12;
class date{
private:
//Pseudo code that represents making a custom type
customType monthType = MONTH/PACK_SIZE;
monthType.remainder = MONTH % PACK_SIZE;
monthType months = 12;
};
The idea is to be able to fit every variable into the minimum bit size and track how many bits are left over.
Theoretically, it would be possible to make use of every unused bit and improve memory efficiency. Obviously this would never work anything like this, but the example is just to explain the concept.

This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct or class, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).
Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.
The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct contains only int and a couple of char variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double, padding to 8 bytes is done to ensure, but three double will, typically, take up 8 * 3 bytes with no further padding.
Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.
If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n))), e.g. int x[4] __attribute__((aligned(32))); would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.
It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof.
Note that it gives the alignment required for the type, so if you do something daft like:
int x;
char buf[4 * sizeof(int)];
int *p = (int *)buf + 7;
std::cout << alignof(*p) << std::endl;
the code will print 4, although the alignment of buf+7 is probably 3 (7 modulo 4).
Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.
It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.
As an example of aligned vs. unaligned access:
#include <cstdio>
#include <cstdlib>
#include <vector>
#define LOOP_COUNT 1000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
struct A
{
long a;
long b;
long d;
char c;
};
struct B
{
long a;
long b;
long d;
char c;
} __attribute__((packed));
std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);
int main()
{
for (int i = 0; i < LOOP_COUNT; i++)
{
arr1[i].a = arr2[i].a = rand();
arr1[i].b = arr2[i].b = rand();
arr1[i].c = arr2[i].c = rand();
arr1[i].d = arr2[i].d = rand();
}
printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
for(int loops = 0; loops < 10; loops++)
{
printf("Run %d\n", loops);
size_t sum = 0;
size_t sum2 = 0;
unsigned long long before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
unsigned long long after = rdtscl();
printf("ARR1 %lld sum=%zd\n",(after - before), sum);
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
after = rdtscl();
printf("ARR2 %lld sum=%zd\n",(after - before), sum2);
}
}
[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]
Then the results:
$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
As you can see, the code that is aligned, using arr1 takes around 3900 clock-cycles, and the one using arr2 takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.
Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.
[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]

How can i store 2 numbers in a 1 byte char?

I have the question of the title, but If not, how could I get away with using only 4 bits to represent an integer?
EDIT really my question is how. I am aware that there are 1 byte data structures in a language like c, but how could I use something like a char to store two integers?

In C or C++ you can use a struct to allocate the required number of bits to a variable as given below:
#include <stdio.h>
struct packed {
unsigned char a:4, b:4;
};
int main() {
struct packed p;
p.a = 10;
p.b = 20;
printf("p.a %d p.b %d size %ld\n", p.a, p.b, sizeof(struct packed));
return 0;
}
The output is p.a 10 p.b 4 size 1, showing that p takes only 1 byte to store, and that numbers with more than 4 bits (larger than 15) get truncated, so 20 (0x14) becomes 4. This is simpler to use than the manual bitshifting and masking used in the other answer, but it is probably not any faster.

You can store two 4-bit numbers in one byte (call it b which is an unsigned char).
Using hex is easy to see that: in b=0xAE the two numbers are A and E.
Use a mask to isolate them:
a = (b & 0xF0) >> 4
and
e = b & 0x0F
You can easily define functions to set/get both numbers in the proper portion of the byte.
Note: if the 4-bit numbers need to have a sign, things can become a tad more complicated since the sign must be extended correctly when packing/unpacking.

C/C++ Bit Array or Bit Vector

I am learning C/C++ programming & have encountered the usage of 'Bit arrays' or 'Bit Vectors'. Am not able to understand their purpose? here are my doubts -
Are they used as boolean flags?
Can one use int arrays instead? (more memory of course, but..)
What's this concept of Bit-Masking?
If bit-masking is simple bit operations to get an appropriate flag, how do one program for them? is it not difficult to do this operation in head to see what the flag would be, as apposed to decimal numbers?
I am looking for applications, so that I can understand better. for Eg -
Q. You are given a file containing integers in the range (1 to 1 million). There are some duplicates and hence some numbers are missing. Find the fastest way of finding missing
numbers?
For the above question, I have read solutions telling me to use bit arrays. How would one store each integer in a bit?

I think you've got yourself confused between arrays and numbers, specifically what it means to manipulate binary numbers.
I'll go about this by example. Say you have a number of error messages and you want to return them in a return value from a function. Now, you might label your errors 1,2,3,4... which makes sense to your mind, but then how do you, given just one number, work out which errors have occured?
Now, try labelling the errors 1,2,4,8,16... increasing powers of two, basically. Why does this work? Well, when you work base 2 you are manipulating a number like 00000000 where each digit corresponds to a power of 2 multiplied by its position from the right. So let's say errors 1, 4 and 8 occur. Well, then that could be represented as 00001101. In reverse, the first digit = 1*2^0, the third digit 1*2^2 and the fourth digit 1*2^3. Adding them all up gives you 13.
Now, we are able to test if such an error has occured by applying a bitmask. By example, if you wanted to work out if error 8 has occured, use the bit representation of 8 = 00001000. Now, in order to extract whether or not that error has occured, use a binary and like so:
00001101
& 00001000
= 00001000
I'm sure you know how an and works or can deduce it from the above - working digit-wise, if any two digits are both 1, the result is 1, else it is 0.
Now, in C:
int func(...)
{
int retval = 0;
if ( sometestthatmeans an error )
{
retval += 1;
}
if ( sometestthatmeans an error )
{
retval += 2;
}
return retval
}
int anotherfunc(...)
{
uint8_t x = func(...)
/* binary and with 8 and shift 3 plaes to the right
* so that the resultant expression is either 1 or 0 */
if ( ( ( x & 0x08 ) >> 3 ) == 1 )
{
/* that error occurred */
}
}
Now, to practicalities. When memory was sparse and protocols didn't have the luxury of verbose xml etc, it was common to delimit a field as being so many bits wide. In that field, you assign various bits (flags, powers of 2) to a certain meaning and apply binary operations to deduce if they are set, then operate on these.
I should also add that binary operations are close in idea to the underlying electronics of a computer. Imagine if the bit fields corresponded to the output of various circuits (carrying current or not). By using enough combinations of said circuits, you make... a computer.

regarding the usage the bits array :
if you know there are "only" 1 million numbers - you use an array of 1 million bits. in the beginning all bits will be zero and every time you read a number - use this number as index and change the bit in this index to be one (if it's not one already).
after reading all numbers - the missing numbers are the indices of the zeros in the array.
for example, if we had only numbers between 0 - 4 the array would look like this in the beginning: 0 0 0 0 0.
if we read the numbers : 3, 2, 2
the array would look like this: read 3 --> 0 0 0 1 0. read 3 (again) --> 0 0 0 1 0. read 2 --> 0 0 1 1 0. check the indices of the zeroes: 0,1,4 - those are the missing numbers
BTW, of course you can use integers instead of bits but it may take (depends on the system) 32 times memory
Sivan

Bit Arrays or Bit Vectors can be though as an array of boolean values. Normally a boolean variable needs at least one byte storage, but in a bit array/vector only one bit is needed.
This gets handy if you have lots of such data so you save memory at large.
Another usage is if you have numbers which do not exactly fit in standard variables which are 8,16,32 or 64 bit in size. You could this way store into a bit vector of 16 bit a number which consists of 4 bit, one that is 2 bit and one that is 10 bits in size. Normally you would have to use 3 variables with sizes of 8,8 and 16 bit, so you only have 50% of storage wasted.
But all these uses are very rarely used in business aplications, the come to use often when interfacing drivers through pinvoke/interop functions and doing low level programming.

Bit Arrays of Bit Vectors are used as a mapping from position to some bit value. Yes it's basically the same thing as an array of Bool, but typical Bool implementation is one to four bytes long and it uses too much space.
We can store the same amount of data much more efficiently by using arrays of words and binary masking operations and shifts to store and retrieve them (less overall memory used, less accesses to memory, less cache miss, less memory page swap). The code to access individual bits is still quite straightforward.
There is also some bit field support builtin in C language (you write things like int i:1; to say "only consume one bit") , but it is not available for arrays and you have less control of the overall result (details of implementation depends on compiler and alignment issues).
Below is a possible way to answer to your "search missing numbers" question. I fixed int size to 32 bits to keep things simple, but it could be written using sizeof(int) to make it portable. And (depending on the compiler and target processor) the code could only be made faster using >> 5 instead of / 32 and & 31 instead of % 32, but that is just to give the idea.
#include <stdio.h>
#include <errno.h>
#include <stdint.h>
int main(){
/* put all numbers from 1 to 1000000 in a file, except 765 and 777777 */
{
printf("writing test file\n");
int x = 0;
FILE * f = fopen("testfile.txt", "w");
for (x=0; x < 1000000; ++x){
if (x == 765 || x == 777760 || x == 777791){
continue;
}
fprintf(f, "%d\n", x);
}
fprintf(f, "%d\n", 57768); /* this one is a duplicate */
fclose(f);
}
uint32_t bitarray[1000000 / 32];
/* read file containing integers in the range [1,1000000] */
/* any non number is considered as separator */
/* the goal is to find missing numbers */
printf("Reading test file\n");
{
unsigned int x = 0;
FILE * f = fopen("testfile.txt", "r");
while (1 == fscanf(f, " %u",&x)){
bitarray[x / 32] |= 1 << (x % 32);
}
fclose(f);
}
/* find missing number in bitarray */
{
int x = 0;
for (x=0; x < (1000000 / 32) ; ++x){
int n = bitarray[x];
if (n != (uint32_t)-1){
printf("Missing number(s) between %d and %d [%x]\n",
x * 32, (x+1) * 32, bitarray[x]);
int b;
for (b = 0 ; b < 32 ; ++b){
if (0 == (n & (1 << b))){
printf("missing number is %d\n", x*32+b);
}
}
}
}
}
}

That is used for bit flags storage, as well as for parsing different binary protocols fields, where 1 byte is divided into a number of bit-fields. This is widely used, in protocols like TCP/IP, up to ASN.1 encodings, OpenPGP packets, and so on.

printing double in binary

In 'Thinking in C++' by Bruce Eckel, there is a program given to print a double value
in binary. (Chapter 3, page no. 189)
int main(int argc, char* argv[])
{
if(argc != 2)
{
cout << "Must provide a number" << endl;
exit(1);
}
double d = atof(argv[1]);
unsigned char* cp = reinterpret_cast<unsigned char*>(&d);
for(int i = sizeof(double); i > 0 ; i -= 2)
{
printBinary(cp[i-1]);
printBinary(cp[i]);
}
}
Here while printing cp[i] when i=8 (assuming double is of 8 bytes), wouldn't it be undefined behaviour?
I mean this code doesn't work as it doesn't print cp[0].

A1: Yes, it would be undefined behaviour when it accesses cp[8].
A2: Yes, it also does not print cp[0].
As shown, it prints bytes 7, 8, 5, 6, 3, 4, 2, 1 of the valid values 0..7. So, if you have copied the code correctly from the book, there is a bug in the book's code. Check the errata page for the book, if there is one.
It is also odd that it unwinds the loop; a simpler formulation is:
for (int i = sizeof(double); i-- > 0; )
printBinary(cp[i]);
There is also, presumably, a good reason for printing the bytes in reverse order; it is not obvious what that would be.

It looks like a typo in the book's code. The second call should probably be printBinary(cp[i-2]).
This is a bit wierd though, because they're reversing the byte order compared to what's actually in memory (IEEE 754 floating point numbers have no rules about endianness, so I guess it's valid on his platform), and because he's counting by 2 instead of just 1.
It would be simpler to write
for(int i = 0; i != sizeof(double) ; ++i) printBinary(cp[i]);
or (if reversing the bytes is important) use the standard idiom for a loop that counts down
for(int i = sizeof(double); (i--) > 0;) printBinary(cp[i]);

You can do this in an endian-independent way, by casting the double to an unsigned long long. Then you can use simple bit shifting on the integer to access and print the bits, from bit 0 to bit 63.
(I've written a C function called "print_raw_double_binary() that does this -- see my article Displaying the Raw Fields of a Floating-Point Number for details.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js