Deleting duplicate entries in a log file C++ - c++

I've written a program to parse through a log file. The file in question is around a million entries or so, and my intent is to remove all duplicate entries by date. So if there's 100 unique log-ins on a date, it will only show one log-in per name. The log output I've created is in the form:
AA 01/Jan/2013
AA 01/Jan 2013
BB 01/Jan 2013
etc. etc. all through the month of January.
This is what I've written so far, the constant i in the for loop is the amount of entries to be sorted through and namearr & datearr are the arrays used for name and date. My end game is to have no repeated values in the first field that correspond to each date. I'm trying to follow proper etiquette and protocols so if I'm off base with this question I apologize.
My first thought in solving this myself is to nest a for loop to compare all previous names to the date, but since I'm learning about Data Structures and Algorithm Analysis, I don't want to creep up to high run times.
if(inFile.is_open())
{
for(int a=0;a<i;a++)
{
inFile>>name;//Take input file name
namearr[a]=name;//Store file name into array
//If names are duplicates, erase them
if(namearr[a]==temp)
{
inFile.ignore(1000,'\n');//If duplicate, skip to next line
}
else
{
temp=name;
inFile.ignore(1,' ');
inFile>>date;//Store date
datearr[a]=date;//Put date into array
inFile.ignore(1000,'\n');//Skip to next like
cout<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to window
oFile<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to file
}
}
}

Ughhh... You better use a Regular Expression library to easily deal with that size of a file. Check Boost Regex
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/index.html

You can construct a key composed of the name and the date with simple string concatenation. That string becomes the index to a map. As you are processing the file line by line, check to see if that string is already in the map. If it is, then you have encountered the name on that day once before. If you've seen it already do one thing, if it's new do another.
This is efficient because you're constructing a string that will only be found a second time if the name has already been seen on that date and maps efficiently search the space of keys to find if a key exists in the map or not.

Related

How to read a specific line from a text file in c++?

C++ program that displays on the screen item codes with corresponding
item descriptions and prices. It asks the user to enter the code of the item
purchased by a customer. It looks for a match of the item code stored in items.txt.
How can I output only a specific line from a text file after the user inputs the item code?
You need to read the file line-by-line (std::getline), extract (depending on the exact format, e.g. by searching for a whitespace in the string) and compare the code and then return the corresponding line on a match.
It is not possible to access lines from a text file directly by index or content.
This is assuming that you mean the file contains lines in the form
code1 item1
code2 item2
//...
If the code is just the index of the line, then you only need to call std::getline in a loop with a loop counter for the current index of the line.
If you do this multiple times on the same file, you should probably parse the whole content first line-by-line into a std::vector<std::string> or a std::(unordered_)map<std::string, std::string> or something similar to avoid the costly repeated iteration.
Depending on the use case, maybe it would be even better to parse the data into a database first and then query the database, even if it is only e.g. sqlite or something like that.

C++ trying to read in malformed CSV with erroneous commas

I am trying to make a simple CSV file parser to transfer a large number of orders from an order system to an invoicing system. The issue is that the CSV which i am downloading has erroneous commas which are sometimes present in the name field and so this throws the whole process off.
The company INSISTS, which is really starting to piss me off, that they are simply copying data they receive into the CSV and so it's valid data.
Excel mostly seems to interpret this correctly or at least puts the data in the right field, my program however doesn't. I opened the CSV in notepad++ and there is no quotes around strings just raw string separated by commas.
This is currently how i am reading the file.
int main()
{
string t;
getline(cin, t);
string Output;
string path = "in.csv";
ifstream input(path);
vstring readout;
vstring contact, InvoiceNumber, InvoiceDate, DueDate, Description, Quantity, UnitAmount, AccountCode, TaxType, Currency, Allocator, test, Backup, AllocatorBackup;
vector<int> read, add, total;
if (input.is_open()) {
for (string line; getline(input, line); ) {
auto arr = explode(line, ',');
contact.push_back(arr[7]); // Source site is the customer in this instance.
InvoiceNumber.push_back(arr[0]); // OrderID will be invoice number
InvoiceDate.push_back(arr[1]); // Perchase date
DueDate.push_back(arr[1]); // Same as order date
Description.push_back(arr[0]);
Quantity.push_back(arr[0]);
UnitAmount.push_back(arr[10]); // The Total
AccountCode.push_back(arr[7]); // Will be set depending on other factors - But contains the site of perchase
Currency.push_back(arr[11]); // EUR/GBP
Allocator.push_back(arr[6]); // This will decide the VAT treatment normally.
AllocatorBackup.push_back(arr[5]); // This will decide VAT treatment if the column is off by one.
Backup.push_back(arr[12]);
TaxType = Currency;
}
}
return 0;
}
vstring explode(string const & s, char delim) {
vstring result;
istringstream q(s);
for (string token; getline(q, token, delim); ) {
result.push_back(move(token));
}
return result;
}
Vstring is a compiler macro i created to save me typing vector so often, so it's the same thing.
The issue is when i come across one of the fields with the comma in it (normally the name field which is [3]) it of cause pushes everything back by one so account code becomes [8] etc.. This is extremely troublesome as it's difficult to tell weather or not i am dealing with correct data in the next field or not in some cases.
So two questions:
1) Is there any simple way in which i could detect this anomaly and correct for it that i've missed? I of cause do try to check in my loop where i can if valid data is where it's expected to be, but this is becoming messy and does not cope with more than one comma.
2) Is the company correct in telling me that it's "Expected behavior" to allow commas entered by a customer to creep into this CSV without being processed or have they completely misunderstood the CSV "standard"?
Retired Ninja mentioned in the comments that one constraint would be to parse all fields either side of the 'problem field' first, and then put the remaining data into the problem field. This is the best approach if you know which field might contain corruption. If you don't know which field could be corrupted, you still have options though!
You know:
The number of fields that should be present
Something about the type of data in each of those fields.
If you codify the types of the fields (implement classes for different data types, so your vectors of strings would become vectors of OrderIDs or Dates or Counts or....), you can test different concatenations (joining adjacent fields that are separated by a comma) and score them according to how many of the fields pass some data validation. You then choose the best scoring interpretation of the data. This would build some data validation into the process, and make everything a bit more robust.
'csv' is not that well defined. There is the standard way, where ',' seperates the columns and '\n' the rows. Sometimes ' " ' is used to handle these symbols inside a field. But Excel includes them only if a Control Character is involved.
Here the definition from Wiki.
RFC 4180 formalized CSV. It defines the MIME type "text/csv", and CSV files that follow its rules should be very widely portable. Among its requirements:
-MS-DOS-style lines that end with (CR/LF) characters (optional for the
last line).
-An optional header record (there is no sure way to detect
whether it is present, so care is required when importing).
-Each record "should" contain the same number of comma-separated fields.
-Any field may be quoted (with double quotes).
-Fields containing a line-break, double-quote or commas should be quoted. (If > they are not, the file will likely be impossible to process correctly).
-A (double)quote character in a field must be represented by two (double) quote > characters.
Comma-separated values
Keep in mind that Excel has different settings on different systems/system language settings. It might be, that their Excel is parsing it correctly, but somewhere else it isn't.
For Example, in countries like Germany there is ';' used to seperate the columns. The decimal seperators differ as well.
1.5 << english
1,5 << german
Same goes for the thousand seperator.
1,000,000 << english
1.000.000 << german
or
1 000 000 << also german
Now, Excel also has different csv export settings like .csv(Seperated values), .csv(MACINTOSH) and .csv(MS-DOS) so I guess there can be differences too.
Now for your questions, in my opinion they are not clearly wrong with what they are doing with their files. But you should think about discussing about a (E)BNF with them. Here some Links:
BNF
EBNF
It is a grammar on which you decide on and with clear definitions the code should be no problem. I know customers can block something like this, because they don't want to have extra work, but it is simply the best solution. If you want ' " ' in your file, they should provide you somehow. I don't know how they copy their data, but it should also be some kind of program (I don't think they do this per hand?), so your code and their code should use the same (E)BNF which you decide on together with them.

How to use regexp to find unique combinations of letters and use them as variables in Matlab?

I have the file names of four files stored in a cell array called F2000. These files are named:
L14N_2009_2000MHZ.txt
L8N_2009_2000MHZ.txt
L14N_2010_2000MHZ.txt
L8N_2009_2000MHZ.txt
Each file consists of an mxn matrix where m is the same but n varies from file to file. I'd like to store each of the L14N files and each of the L8N files in two separate cell arrays so I can use dlmread in a for loop to store each text file as a matrix in an element of the cell array. To do this, I wrote the following code:
idx2009=cellfun('isempty',regexp(F2000,'L\d{1,2}N_2009_2000MHZ.txt'));
F2000_2009=F2000(idx2009);
idx2010=~idx2009;
F2000_2010=F2000(idx2010);
cell2009=cell(size(F2000_2009));
cell2010=cell(size(F2000_2010));
for k = 1:numel(F2000_2009)
cell2009{k}=dlmread(F2000_2009{k});
end
and repeated a similar "for" loop to use on F2000_2010. So far so good. However.
My real data set is much larger than just four files. The total number of files will vary, although I know there will be five years of data for each L\d{1,2}N (so, for instance, L8N_2009, L8N_2010, L8N_2011, L8N_2012, L8N_2013). I won't know what the number of files is ahead of time (although I do know it will range between 50 and 100), and I won't know what the file names are, but they will always be in the same L\d{1,2}N format.
In addition to what's already working, I want to count the number of files that have unique combinations of numbers in the portion of the filename that says L\d{1,2}N so I can further break down F2000_2010 and F2000_2009 in the above example to F2000_2010_L8N and F2000_2009_L8N before I start the dlmread loop.
Can I use regexp to build a list of all of my unique L\d{1,2}N occurrences? Next, can I easily change these list elements to strings to parse the original file names and create a new file name to the effect of L14N_2009, where 14 comes from \d{1,2}? I am sure this is a beginner question, but I discovered regexp yesterday! Any help is much appreciated!
Here is some code which might help:
% Find all the files in your directory
files = dir('*2000MHZ.txt');
files = {files.name};
% match identifiers
ids = unique(cellfun(#(x)x{1},regexp(files,'L\d{1,2}N','match'),...
'UniformOutput',false));
% find all years
years = unique(cellfun(#(x)x{1},regexp(files,'(?<=L\d{1,2}N_)\d{4,}','match'),...
'UniformOutput',false));
% find the years for each identifier
for id_ix = 1:length(ids)
% There is probably a better way to do this
list = regexp(files,['(?<=' ids{id_ix} '_)\d{4,}'],'match');
ids_years{id_ix} = cellfun(#(x)x{1},list(cellfun(...
#(x)~isempty(x),list)),'uniformoutput',false);
end
% If you need dynamic naming, I would suggest dynamic struct names:
for ix_id = 1:length(ids)
for ix_year = 1:length(ids_years{ix_id})
% the 'Y' is in the dynamic name becuase all struct field names must start with a letter
data.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) =...
'read in my data here for each one';
end
end
Also, if anyone is interested in mapping keys with values try looking into the containers.map class.

Splitting an ifstream in C++

I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.
The file in question is a sorted csv and I wish to search on the first value of each line of the file.
Eventually the file will be very large so I am trying to avoid having to read every line of the file.
e.g.
If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.
First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.
This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.
Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).
A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.
So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.
Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).
The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.
Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.

How to parse text-based table in C++

I am trying to parse a table in the form of a text file using ifstream, and evaluating/manipulating each entry. However, I'm having trouble figuring out how to approach this because of omissions of particular items. Consider the following table:
NEW VER ID NAME
1 2a 4 "ITEM ONE" (2001)
1 7 "2 ITEM" (2002) {OCT}
1.1 10 "SOME ITEM 3" (2003)
1 12 "DIFFERENT ITEM 4" (2004)
1 a4 16 "ITEM5" (2005) {DEC}
As you can see, sometimes the "NEW" column has nothing in it. What I want to do is take note of the ID, the name, the year (in brackets), and note whether there are braces or not afterwards.
When I started doing this, I looked for a "split" function, but I realized that it would be a bit more complicated because of the aforementioned missing items and the titles becoming separated.
The one thing I can think of is reading each line word by word, keeping track of the latest number I saw. Once I hit a quotation mark, make note that the latest number I saw was an ID (if I used something like a split, the array position right before the quotation mark), then keep record of everything until the next quote (the title), then finally, start looking for brackets and braces for the other information. However, this seems really primitive and I'm looking for a better way to do this.
I'm doing this to sharpen my C++ skills and work with larger, existing datasets, so I'd like to use C++ if possible, but if another language (I'm looking at Perl or Python) makes this trivially easy, I could just learn how to interface a different language with C++. What I'm trying to do now is just sifting data anyways which will eventually become objects in C++, so I still have chances to improve my C++ skills.
EDIT: I also realize that this is possible to complete using only regex, but I'd like to try using different methods of file/string manipulation if possible.
If the column offsets are truly fixed (no tabs, just true space chars a la 0x20) I would read it a line at a time (string::getline) and break it down using the fixed offsets into a set of four strings (string::substr).
Then postprocess each 4-tuple of strings as required.
I would not hard-code the offsets, store them in a separate input file that describes the format of the input - like a table description in SQL Server or other DB.
Something like this:
Read the first line, find "ID", and store the index.
Read each data line using std::getline().
Create a substring from a data line, starting at the index you found "ID" in the header line. Use this to initialize a std::istringstream with.
Read the ID using iss >> an_int.
Search the first ". Search the second ". Search the ( and remember its index. Search the ) and remember that index, too. Create a substring from the characters in between those indexes and use it to initialize another std::istringstream with. Read the number from this stream.
Search for the braces.