I have a CSV which must be read and have duplicate values removed before it gets written.
Duplicate value would be based on two columns (date, price) (AND conditional statement). Therefore in the example below row 1, row 2, and row 4 would get written to CSV. Row 3 would qualify as a duplicate (since the same date and price match row 1) and would be excluded (not written to CSV).
address floor date price
40 B STREET 18 3/29/2015 2200000
40 B STREET 23 1/7/2015 999000
40 B STREET 18 3/29/2015 2200000
40 B STREET 18 4/29/2015 2200000
You can use a map of date and price for finding duplicates. I am not sharing the complete code, but this will provide you a pointer of how to do what you want.
1) Create a map of date and price.
std::map<std::string, long> dataMap;
2) Read a line from the CSV. Do a lookup in the dataMap. If key (date) is found, check value (price), if both match, then it is a duplicate and you should ignore this record.
// Read a line from CSV and parse it to store
// different values in below variables. This should
// be in a loop where each loop will be fetching
// a line from the CSV. The loop should continue till
// you reach end of the input CSV file.
int floorValue;
std::string addressValue;
std::string dateValue;
long priceValue;
// The benefit of using map is observed here, where you
// can find a particular date in O(log n) time complexity
auto it = dataMap.find(dateValue)
if (it != dataMap.end())
if (it->second == priceValue)
// Ignore this record
else
{
// Write this entry into another CSV file.
// You can later rename this to the original
// CSV file, which will give an impression that
// your duplicate entries have been removed
// from the original CSV file.
}
Related
I have a file that contains numbers:
23 899 234 12
12 366 100 14
10 256 500 23
14 888 564 30
How can I read this file by column using C++? I searched YouTube but they only read file by row. If I have to find the highest value from the first column, I need to read it by columns right?
Try initialising a variable called column =0
Now when you access row iterate through column using a for loop until end of column which can be found out by no of lines in a file, then implement the operation what you are willing to do, now increase column no for going to next column.
And you can get value in form of n*vectors
Where n is no of rows
Dimension of vector is length of column
Files are storages with sequential access, so , generally you have to read everything. Essentially it's like reading from a tape. And format of your file doesn't offer any shortcuts.
But, you can fast-forward and rewind along file, using seek() if it's a file on permanent storage. It's not effective and you have to know position where to go. If your records are of same size, you can advance by fixed amount of bytes.
That is usually done with binary formats and those formats are designed to have some kind directory or other auxiliary data to help searching for proper position..
Read each line and split it by space and store the first value as highest value. repeat the steps for all rows until you reach end of file while comparing the first value again with already stored value as highest.
I don't have C++ compiler available with myself to test. it should work for you conceptually.
#include <fstream>
#include <string>
#include <iostream>
#include<sstream>
using namespace std;
int main() {
ifstream inFile("Read.txt");
string line;
int greatest = 0;
while (getline(inFile, line)) {
stringstream ss(line);
string FirstColumn;
while (ss >> FirstColumn) {
if (stoi(FirstColumn) > greatest)
greatest = stoi(FirstColumn);
}
cout << greatest << endl;
}
}
I am creating a command-line minesweeper game which has a save and continue capability. My code generates a file called "save.txt" which stores the position of the mines and the cells that the player has opened. It is separated into two columns delimited by a space where the left column represents the row of the cell and the right column represents the column of the cell in the matrix generated by my code. The following is the contents of save.txt after a sample run:
3 7
3 9
5 7
6 7
8 4
Mine end
2 9
1 10
3 5
1 1
Cell open end
You may have noticed Mine end and Cell open end. These two basically separate the numbers into two groups where the first one is for the position of the mines and the latter is for the position of the cells opened by the player. I have created a code which generates an array for each column provided that the text file contains integers:
int arrayRow[9];
int arrayCol[9];
ifstream infile("save.txt");
int a, b;
while(infile >> a >> b){
for(int i = 0; i < 9; i++){
arrayRow[i] = a;
arrayCol[i] = b;
}
}
As you can see, this won't quite work with my text file since it contains non-integer text. Basically, I want to create four arrays: mineRow, mineCol, openedRow, and openedCol as per described by the first paragraph.
Aside from parsing the string yourself and doing string operations, you can probably redefine the file format to have a header. Then you can parse the once and keep everything in numbers. Namely:
Let the Header be the first two rows
Row 1 = mineRowLen mineColLen
Row 2 = openedRowLen openedColLen
Row 3...N = data
save.txt:
40 30
20 10
// rest of the entries
Then you just read 40 for the mineRow, 30 for mineCol, 20 for openedRow, 10 for openedCol since you know their lengths. This will be potentially harder to debug, but would allow you to hide the save state better to disallow easy modification of it.
You can read the file line by line.
If the line matches "Mine end" or "Cell open end", continue;
Else, split the line by space (" "), and fill the array.
I am doing simple project for table processing using flat files in c++. I have two type of files to access the table data.
1) Index File. ( employee.idx )
2) Table File. ( employee.tbl )
In index file, I have table details in the format of tab delimited . i.e.,
Column-name Column-Type Column-Offset Column-Size
for example, employee.idx
ename string 0 10
eage number 10 2
ecity string 12 10
In Table file, I have the data in the format of Fixed Length.
for example, employee.tbl
first 25Address0001
second 31Address0002
Here I will explain my algorithm what I did in my program.
1) First I have loaded index file data in 2D vector String ( Index Vector ) using fstream.
2) This is my code to load Table File Data into 2D
while (fsMain)
{
if (!getline( fsMain, s )) break;
string s_str;
for(size_t i=0;i<idxTable.size();i++)
{
int fieldSize=stoi(idxTable[i].at(3));
string data (s,stoi(idxTable[i].at(2)),fieldSize);
string tmp=trim_right_inplace(data);
recordVec.push_back( tmp );
}
mainTable.push_back(record);
recordVec.clear();
s="";
}
Ok. Now my question is , " Is there any other better way to load the Fixed length data to memory ? ". I checked this process for 60 tables with 200 Records. It takes nearly 20 Seconds. But I want to load 100 tables with 200 records within one Second. But It takes more time. How can I improve efficiency for this task ?
this is quite a primitive problem, so I guess the solution shouldn't be hard, but I didn't find a way how to do it simply, neither have I summarized it to actually find it in the internet.So going to the question, I have a file of information like this:
1988 Godfather 3 33 42
1991 Dance with Wolves 3 35 43
1992 Silence of the lambs 3 33 44
And I have a requirement to put all the information in a data structure, so lets say it will be int year, string name and three more int types for numbers. But how do I know if the next thing I read is a number or not? I never know how long is the word.Thank you in advance for anyone who took their time with such a primitive problem. :)
EDIT: Don't consider movies with numbers in their title.
You're going to have some major issues when you go to try to parse other movies, like, Free Willy 2.
You might try instead to treat it as a std::stringstream and rely on the last three chunks being the data you're looking for rather than generalizing with a Regular Expression.
your best bet would be to use C++ regex
That would give you a more fine grained control over what you want to parse.
examples:
year -> \d{4}
word -> \w+
number->\d+
If you do not have control over the file format, you may want to do something along these lines (pseudo-process):
1) read in the line from the file
2) reverse the order of the "words" in the file
3) read in the 3 ints first
4) read in the rest of the stream as a string
4) reverse the "words" in the new string
5) read in the year
6) the remainder will be the movie title
Read every field as a string and then convert the appropriate string to integers.
1)initially
1983
GodFather
3
33
45
are all strings and stored in a vector of strings (vector<string>).
2)Then 1983(1st string is converted to integer using atoi) and last three strings are also converted to integers. Rest of the strings constitute the movie_name
Following code has been written under the assumption that input file has already been validated for the format.
// open the input file for reading
ifstream ifile(argv[1]);
string input_str;
//Read each line
while(getline(ifile,input_str)) {
stringstream sstr(input_str);
vector<string> strs;
string str;
while(sstr>>str)
strs.push_back(str);
//use the vector of strings to initialize the variables
// year, movie name and last three integers
unsigned int num_of_strs = strs.size();
//first string is year
int year = atoi(strs[0].c_str());
//last three strings are numbers
int one_num = atoi(strs[num_of_strs-3].c_str());
int two_num = atoi(strs[num_of_strs-2].c_str());
int three_num = atoi(strs[num_of_strs-1].c_str());
//rest correspond to movie name
string movie_name("");
//append the strings to form the movie_name
for(unsigned int i=1;i<num_of_strs-4;i++)
movie_name+=(strs[i]+string(" "));
movie_name+=strs[i];
IMHO Changing delimiters in the file from space to some other character like , or ; or : , will simplify the parsing significantly.
For example , if later on the data specifications change and instead of only last three , either last three or last four can be integers then the code above will need major refactoring.
I'm trying to parse a text file that is outputted like the example below, my example has limited entries but my actual one has over 15000 lines, so i can't read these in individually:
ID IC TIME
15:23:43.867 /g/mydata/dataoutputfile.txt identifier
0003 1233 abcd
0043 eb54 abf3
000f 0bb4 ac24
000a a325 ac75
0023 0043 ac91
15:23:44.000 /g/mydata/dataoutputfile.txt identifier
0003 1233 abcd
0043 eb54 abf3
000f 0bb4 ac24
000a a325 ac75
0023 0043 ac91
Is kind of the output I have. The time column resets every so often.
What I am doing now is making 2 additional columns in addition to the 3 i have in my example. The first column is the conversion of the ID column, into a translation into an understandable message. The second additional column will calculate the difference between each time code, except when the time code resets.
My logic is, is to read each column into an array so I can perform the necessary translations and operations.
I am focusing on getting the timecode differential first, as I think getting the translation will be a bit simpler.
The problem I'm having is getting the entries read into their matrices:
my code looks a bit like this:
while(readOK && getline(myfile,line))
{
stringstream ss(line);
string ident,IC,timehex,time,filelocation;
string junk1,junk2;
int ID[count];
int timecode[count2];
int idx=0;
if(line.find("ID") !=string::npos)
{
readOK=ss>>ident>>IC>>timehex;
myfile2<<ident<<"\t\t"<<IC<<"\t\t"<<timehex<<"\t\t"<<"ID Decoded"<<"\t\t"<<"DT"<<endl;
myfile3<<"headers read"<<endl
}
else if(line.find("identifier") != string::npos)
{
readOK=ss>>time>>filelocation;
myfile3<<"time and location read";
myfile2<<time<<"\t\t"<<filelocation<<endl;
}
else //this is for the hex code lines
{
readOK=ss>>hex>>ID[idx]>>IC>>timecode[idx];
if (readOK)
{
myfile2<<setw(4)<<setfill('0')<<hex<<ID[1000]<<"\t\t"<<IC<<"\t\t"<<timecode[1000]<<endl;
myfile3<<"success reading info into arrays"<<endl;
}
else
myfile3<<"error reading hex codes"<<endl;
}
idx++;
}
Although this code doesn't work correctly. I can't just read in every line quite the same because of the intervening time and file location entries that are inserted to help keep track of when I am looking at in my code.
My gut is telling me that I'm calling the matrix entries too early and they haven't been filled yet, because if I cout number 1000, I get a 0 (i have well over 15000 lines in my input file and I have the boundaries of my arrays set dynamically in another part of my program).
I can't seem to figure out how to get the entries assigned correctly as I am having some inheritance issues with the count variable resetting to 0 every time through the loop.
Define int idx outside of the scope of the while loop (before the while). As it is now, each time through the loop it will be reset.