Data lost on MapReduce? - mapreduce

My code executes 10000 rows.
Mapper pseudo-code:
int rows=0;
map()
{rows++}
cleanup(Context c)
{print(rows)}
This code prints:
2669
3354
3353
621
(sum=9997)
Why the sum is 9997?
Reducer pseudo-code:
int rows=0;
reduce()
{rows++}
cleanup(Context c)
{print(rows)}
The reducer prints:
3354
Where is all the other data?
Edit 1
I have found the main problem.
My fault is that the key that is sent is the number of the row. When the mapper calls the cleanup() function, it resets the counter of rows (held in the driver of the application). Therefore the key isn't unique. Can I resolve that by sending the key from the parameters of the map function? I don't think the cleanup() resets this parameter.
If instead I use a global variable in the driver of the application, is there a synchronization problem?
Edit 2
My code executes 10000 rows (and 1 header line)
Driver pseudo-code:
public static enum COUNTER {ROW};
Mapper pseudo-code:
map()
{row=context.getCounter(RWDriver.COUNTER.ROW).increment(1);
context.write(row,new Text(...))
}
cleanup(Context c)
{print(c.getCounter(RWDriver.COUNTER.ROW).getValue());}
This code prints:
2670
3355
3354
622
(sum=10001 correct)
After 2670,3355, the buffer is full and MapReduce automatically resets the counter ROW to 0. I need the actual number of rows, but this method don't work.

The interpretation of the Data might be wrong .
You should either use Map-Reduce Framework Counters or user defined counters :
Map-Reduce Framework Counters
Map input records
Map output records
Map output bytes
Reduce input groups
Reduce input records
Reduce output records
User Defined Counter
class mapper()
{
static enum Counters { INPUT_LINES }
map()
{
context.getCounter(Counters.INPUT_LINES).increment(1);
}
similarly in Reducer too .
Get your value of counters
Configuration conf = new Configuration();
Cluster cluster = new Cluster(conf);
Job job = Job.getInstance(cluster,conf);
result = job.waitForCompletion(true);
...
Counters counters = job.getCounters();
for (CounterGroup group : counters) {
System.out.println("* Counter Group: " + group.getDisplayName() + " (" + group.getName() + ")");
System.out.println(" number of counters in this group: " + group.size());
for (Counter counter : group) {
System.out.println(" - " + counter.getDisplayName() + ": " + counter.getName() + ": "+counter.getValue());
}
}

Related

Why is My Program not Working [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am a noob programmer,who just started in C++. I wrote a program, to answer a question. When I try to run it from my cmd.exe, windows tells me "a problem has caused this program to stop working, we'll close the program and notify you when a solution is available".
I have included a link to the well documented source code. Please take a look at the code, and help me out.
link: http://mibpaste.com/ZRevGf
i believe, that figuring out the error, with my code may help several other noob programmers out there, who may use similar methods to mine.
Code from link:
//This is the source code for a puzzle,well kind of that I saw on the internet. I will include the puzzle's question below.
//Well, I commented it so I hope you understand.
//ALAFIN OLUWATOBI 100L DEPARTMENT OF COMPUTER SCIENCE BABCOCK UNIVERSITY.
//Future CEO of VERI Technologies inc.
/*
* In a corridor, there are 100 doors. All the doors are initially closed.
* You walk along the corridor back and forth. As you walk along the corridor, you reverse the state of each door.
* I.e if the door is open, you close it, and if it is closed, you open it.
* You walk along the corrdor, a total of 200 times.
* On your nth trip, You stop at every nth door, that you come across.
* I.e on your first trip, you stop at every door. On your second trip, every second door, on your third trip every third door and so on and so forth
* Write a program to display, the final states of the doors.
*/
#include <iostream>
#include <cstdlib>
#include <cmath>
using namespace std;
inline void inverse(bool args[]); //The prototype of the function. I made the function inline in the declaration, to increase efficiency, ad speed of execution.
bool doors [200]; //Declaring a global array, for the doors.
int main ()
{
inverse(doors); //A call to the inverse function
cout << "This is the state of the 100 doors...\n";
for (int i = 0 ; i<200 ; i++) //Loop, to dis play the final states of the doors.
{
cout << "DOOR " << (i+1) << "\t|" << doors[i] << endl;
}
cout << "Thank you, for using this program designed by VERI Technologies. :)"; //VERI Technologies, is the name of the I.T company that I hope to establish.
return 0;
}
void inverse(bool args [])
{
for (int n = 1 ; n<= 200 ; n++) //This loop, is for the control of every nth trip. It executes 100 times
{
if (n%2 != 0) //This is to control the reversal of the doors going forward, I.e on odd numbers
{
for (int b = n, a = 1 ; b<=200 ;b = n*++a) //This is the control loop, for every odd trip, going forwards. It executes 100 times
args [b] = !args[b] ; //The reversal operation. It reverses the boolean value of the door.
}
/*
* The two variables, are declared. They will be used in controlling the program. b represents the number of the door to be operated on.
* a is a variable, which we shall use to control the value of b.
* n remains constant for the duration, of the loop, as does (200-n)
* the pre increment of a {++a} multiplied by n or (200-n) is used to calculate the value of b in the update.
* Thus, we have the scenario, of b increasing in multiples of n. Achieving what is desired for the program. Through this construct, only every nth door is considered.
*/
else if((n%2) == 0) //This is to control the reversal of the doors going backwards, I.e on even numbers
{
for (int b = (200-n), a = 1 ; b>=1 ; b = (200-n)*++a) //This is the control loop for every even trip, going backwards. It executes 100 times.
args [b] = !args[b] ; //The reversal operation. It reverses the boolean value of the door.
}
}
}
I believe the exception is due to the line:
for (int b = (200 - n), a = 1; b >= 1; b = (200 - n)*++a)
When the exception occurs the following values are assigned to the variables:
b = 3366
n = 2
a = 17
From what I can see, b is calculated by (200 - n) * a.
If we substitute the values given we have: 198 * 17
This gives us the value of 3366 which is beyond the index of doors and throws the exception when the line
args[b] = !args[b];
is executed.
I have created the following solution that should provide the desired results if you wish to use it.
void inverse(bool args[])
{
//n represents what trip you are taking down the hallway
//i.e. n = 1 is the first trip, n = 2 the second, and so on
for (int n = 1; n <= 200; n++){
//We are on trip n, so now we must change the state of all the doors for the trip
//The current door is represented by i
//i.e. i = 1 is the first door, i = 2 the second, and so on
for (int i = 1; i <= 200; i++){
//If the current door mod the trip is 0 then we must change the state of the door
//Only the nth door will be changed which occurs when i mod n equals 0
//We modify the state of doors[i - 1] as the array of doors is 0 - 199 but we are counting doors from 1 to 200
//So door 1 mod trip 1 will equal 0 so we must change the state of door 1, which is really doors[0]
if (i % n == 0){
args[i - 1] = !args[i - 1];
}
}
}
EUREKA!!!!!!
I finally came up with a working solution. No more errors. I'm calling it version 2.0.0
I've uploaded it online, and here's the link
[version 2.0.0] http://mibpaste.com/3NADgl
All that's left is to go to excel, and derive the final states of the door and be sure, that it's working perfectly. Please take a look at my solution, and comment on any error that I may have made, or any way you think that I may optimize the code.I thank you for your help, it allowed me to redesign a working solution to the program. I'm sstarting to think that an Out-of-bounds error, might have caused my version 1 to crash, but the logic was flawed, anyway, so I'm scrapping it.
This is ths code:
/**********************************************************************************************
200 DOOR PROGRAM
Version 2.0.0
Author: Alafin OluwaTobi Department of Computer Science, Babcock University
New Additions: I redrew, the algorithm, to geneate a more logically viable solution,
I additionally, expanded the size of the array, to prevent a potential out of bounds error.
**********************************************************************************************/
//Hello. This a program,I've written to solve a fun mental problem.
//I'll include a full explanation of the problem, below.
/**********************************************************************************************
*You are in a Hallway, filled with 200 doors .
*ALL the doors are initially closed .
*You walk along the corridor, *BACK* and *FORTH* reversing the state of every door which you stop at .
*I.e if it is open, you close it .
*If it is closed, you open it .
*On every nth trip, you stop at every nth door .
*I.e on your first trip, you stop at every door. On your second trip every second door, On your third trip every third door, etc .
*Write a program to display the final state of the doors .
**********************************************************************************************/
/**********************************************************************************************
SOLUTION
*NOTE: on even trips, your coming back, while on odd trips your going forwards .
*2 Imaginary doors, door 0 and 201, delimit the corridor .
*On odd trips, the doors stopped at will be (0+n) doors .
*I.e you will be counting forward, in (0+n) e.g say, n = 5: 5, 10, 15, 20, 25
*On even trips, the doors stopped at will be (201-n) doors.
*I.e you will be counting backwards in (201-n) say n = 4: 197, 193, 189, 185, 181
**********************************************************************************************/
#include <iostream>
#include <cstdlib> //Including the basic libraries
bool HALLWAY [202] ;
/*
*Declaring the array, for the Hallway, as global in order to initialise all the elements at zero.
*In addition,the size is set at 202 to make provision for the delimiting imaginary doors,
*This also serves to prevent potential out of bound errors, that may occur, in the use of thefor looplater on.
*/
inline void inverse (bool args []) ;
/*
*Prototyping the function, which will be used to reverse the states of the door.
*The function, has been declared as inline in order to allow faster compilation, and generate a faster executable program.
*/
using namespace std ; //Using the standard namespace
int main ()
{
inverse (HALLWAY) ; //Calling the inverse function, to act on the Hallway, reversing the doors.
cout << "\t\t\t\t\t\t\t\t\t\t200 DOOR TABLE\n" ;
for(int i = 1 ; i <= 200 ; i++ )
//A loop to display the states of the doors.
{
if (HALLWAY [i] == 0)
//The if construct allows us to print out the state of the door as closed, when the corresponding element of the Array has a value of zero.
{
cout << "DOOR " << i << " is\tCLOSED" << endl ;
for (int z = 0 ; z <= 300 ; z++)
cout << "_" ;
cout << "\n" ;
}
else if (HALLWAY [i] == 1)
//The else if construct allows us to print out the state of the door as open, when the corresponding element of the Array has a value of one.
{
cout << "DOOR " << i << " is\tOPEN" << endl ;
for (int z = 0 ; z <= 300 ; z++)
cout << "_" ;
cout << "\n" ;
}
}
return 0 ; //Returns the value of zero, to show that the program executed properly
}
void inverse (bool args[])`
{
for ( int n = 1; n <= 200 ; n++)
//This loop, is to control the individual trips, i.e trip 1, 2, 3, etc..
{
if (n%2 == 0)
//This if construct, is to ensure that on even numbers(i,e n%2 = 0), that you are coming down the hallway and counting backwards
{
for (int b = (201-n) ; b <= 200 && b >= 1 ; b -= n)
/*
*This loop, is for the doors that you stop at on your nth trip.
*The door is represented by the variable b.
*Because you are coming back, b will be reducing proportionally, in n.
*The Starting value for b on your nth trip, will be (201-n)
* {b -= n} takes care of this. On the second turn for example. First value of b will be 199, 197, 195, 193, ..., 1
*/
args [b] = !(args [b]) ;
//This is the actual reversal operation, which reverses the state of the door.
}
else if (n%2 != 0)
//This else if construct, is to ensure that on odd numbers(i.e n%2 != 0), that you are going up the hallway and counting forwards
{
for (int b = n ; b <= 200 && b >= 1 ; b += n)
/*
*This loop, is for the doors that you stop at on your nth trip.
*The door is represented by the variable b.
*Because you are going forwards, b will be increasing proportionally, in n.
*The starting value of b will be (0+n) whch is equal to n
* {b += n} takes care of this. On the third turn for example. First value of b will be 3, 6, 9, 12, ...., 198
*/
args [b] = !(args [b]) ;
//This is the actual reversal operation, which reverses the state of the door
}
}
}

Fixed Length Flat File Table Algorithms in c++

I am doing simple project for table processing using flat files in c++. I have two type of files to access the table data.
1) Index File. ( employee.idx )
2) Table File. ( employee.tbl )
In index file, I have table details in the format of tab delimited . i.e.,
Column-name Column-Type Column-Offset Column-Size
for example, employee.idx
ename string 0 10
eage number 10 2
ecity string 12 10
In Table file, I have the data in the format of Fixed Length.
for example, employee.tbl
first 25Address0001
second 31Address0002
Here I will explain my algorithm what I did in my program.
1) First I have loaded index file data in 2D vector String ( Index Vector ) using fstream.
2) This is my code to load Table File Data into 2D
while (fsMain)
{
if (!getline( fsMain, s )) break;
string s_str;
for(size_t i=0;i<idxTable.size();i++)
{
int fieldSize=stoi(idxTable[i].at(3));
string data (s,stoi(idxTable[i].at(2)),fieldSize);
string tmp=trim_right_inplace(data);
recordVec.push_back( tmp );
}
mainTable.push_back(record);
recordVec.clear();
s="";
}
Ok. Now my question is , " Is there any other better way to load the Fixed length data to memory ? ". I checked this process for 60 tables with 200 Records. It takes nearly 20 Seconds. But I want to load 100 tables with 200 records within one Second. But It takes more time. How can I improve efficiency for this task ?

MongoDB record missing after insertion

I am using MongoDB 2.4.5 64 bit on Linux using C++ API to insert 1 M record
I did turn on write concern after the connection
mongo.setWriteConcern(mongo::W_NORMAL);
for (int i=0; i<RECORDS; i++) {
mongo::BSONObj record = BSON (
"_id" << i <<
"mystring" << "hello world" );
bulk_data.push_back(record);
if (i % 10000 == 0) {
mongo.insert("insert_test.col1", bulk_data);
}
}
Surprisingly at the end when I do count (via count(), it only shows 990001 records from collection 'insert_test.col1'.
What did I do wrong? Thanks for your help.
You're missing mongo.insert("insert_test.col1", bulk_data); at the end of (immediately after) your loop -- unless RECORDS is one less than a multiple of 10000 (you said it was 1000000, which isn't), then the last 9999 iterations are not inserted because they're still in bulk_data!
In other words, i is only 999999 on the last iteration through the loop, so the if isn't entered, and the last 9999 records that were put in bulk_data are not inserted.
Also, bulk_data needs to be cleared after being inserted:
if (i % 10000 == 0) {
mongo.insert("insert_test.col1", bulk_data);
bulk_data.clear(); // <-----
}

Filter strange C++ multimap values

I have this multimap in my code:
multimap<long, Note> noteList;
// notes are added with this method. measureNumber is minimum `1` and doesn't go very high
void Track::addNote(Note &note) {
long key = note.measureNumber * 1000000 + note.startTime;
this->noteList.insert(make_pair(key, note));
}
I'm encountering problems when I try to read the notes from the last measure. In this case the song has only 8 measures and it's measure number 8 that causes problems. If I go up to 16 measures it's measure 16 that causes the problem and so on.
// (when adding notes I use as key the measureNumber * 1000000. This searches for notes within the same measure)
for(noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000); noteIT->first < (this->curMsr + 1) * 1000000; noteIT++){
if(this->curMsr == 8){
cout << "_______________________________________________________" << endl;
cout << "ID:" << noteIT->first << endl;
noteIT->second.toString();
int blah = 0;
}
// code left out here that processes the notes
}
I have only added one note to the 8th measure and yet this is the result I'm getting in console:
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
This keeps repeating. The first result is a correct note which I've added myself but I have no idea where the note with ID: 1 is coming from.
Any ideas how to avoid this? This loop gets stuck repeating the same two results and I can't get out of it. Even if there are several notes within measure 8 (so that means several values within the multimap that start with 8xxxxxx it only repeats the first note and the non-existand one.
You aren't checking for the end of your loop correctly. Specifically there is no guarantee that noteIT does not equal trackIT->noteList.end(). Try this instead
for (noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000);
noteIT != trackIT->noteList.end() &&
noteIT->first < (this->curMsr + 1) * 1000000;
++noteIT)
{
For the look of it, it might be better to use some call to upper_bound as the limit of your loop. That would handle the end case automatically.

C++ and MPI how to write part of code as parallel?

I've been writing some code using PETSc library and now I'm going to change a part of it to be run as parallel. Most of the things what I want to parallelize is matrix initializings and the parts where I generate and calculate a large amount of values. Anyway my problem is following if I run the code with more than 1 core for some reason all parts of the code will be run as many times as how many cores I use.
This is just simple sample code where I tested PETSc and MPI
int main(int argc, char** argv)
{
time_t rawtime;
time ( &rawtime );
string sta = ctime (&rawtime);
cout << "Solving began..." << endl;
PetscInitialize(&argc, &argv, 0, 0);
Mat A; /* linear system matrix */
PetscInt i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
PetscMPIInt rank; // processor rank
PetscMPIInt size; // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;
cout << "Generating 2D-Array" << endl;
double temp2D[120000][3];
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
temp2D[Ii][J] = 1;
}
}
cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;
v = -1.0;
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
}
}
cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
MPI_Finalize();
cout << "No more MPI" << endl;
return 0;
}
And my real program has a couple different .cpp files. I initialize MPI in the main program what calls a function in another .cpp file where I did implement same kind of matrix filling but all the cout's what the program does before filling the matrices will be printed as many times as the number of my cores.
I can run my test program as mpiexec -n 4 test and it runs successfully but for some reason I have to run my real program as mpiexec -n 4 ./myprog
Output of my test program is following
Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI
Edit after two comments:
So my goal is to run this on small cluster which has 20 nodes and each node has 2 cores. Later on this should be running on super computer so mpi is definitely the way I need to go. I'm currently testing this on two different machines one of them has 1 processor / 4 cores and second has 4 processor / 16 cores.
MPI is an implementation of the SPMD/MPMD model (single program multiple data / multiple programs multiple data). An MPI job consists of concurrently running processes that exchange messages between each other in order to cooperate on solving a problem. You cannot run only part of the code in parallel. You can only have parts of the code that do not communicate with each other but still execute concurrently. And you ought use mpirun or mpiexec to start your application in parallel mode.
If you'd like to make only parts of your code parallel and could live with the limitation that you can only run the code on a single machine, then what you need is OpenMP and not MPI. Or you can also use low-level POSIX threads programming as according to the PETSc web site, it supports pthreads. And OpenMP is built on top of pthreads so using PETSc with OpenMP might be possible.
To add to Hristo's answer, MPI is built to run in a distributed fashion, i.e. completely separate processes. They have to be separate, because they are supposed to be on different physical machines. You can run multiple MPI processes on one machine, for example one per core. That's perfectly OK, but MPI does not have any tools to take advantage of that shared memory context. In other words, you cannot have some MPI ranks (processes) do work on a matrix that is owned by another MPI process because you have no way to share the matrix.
When you start x MPI processes you get x copies of the same exact program running. You need code like
if (rank == 0)
do something
else
do something else
to have the different processes do different things. The processes can communicate with each other by send messages, but they all run the same exact binary.
If you don't have the code diverge, then you'll just get x copies of the same program give the same result x times.