Need to keep duplicates and delete all unique lines - regex

I have two large text files
Looking for something that compares the two files for lines containing the same string and deleting all the rest. Hope that makes sense.
Example:
list1.txt
number1:1010:1020:1030
number2:1010:1020:1030
number3:1010:1020:1030
number4:1010:1020:1040
list 2.txt
number1
number2
number3
number100
output=
number1:1010:1020:1030
number2:1010:1020:1030
number3:1010:1020:1030
Anything that can do this? I would really appreciate help, thank you.

I'm going to sketch what you should program.
Read file1.txt into a List<String> line by line.
Same for file2.txt.
Loop over the lines of file 1, and check if the line .contains() any of the lines you read in file 2. (By nesting another loop).
If it does contain something of file 2, immediately go to the next line of file 1 (by continue-ing the outer loop). If it does not, then delete that element from the list (don't forget to decrease the loop variable, as you deleted the item at the current index).

Related

Set Difference in Notepad++ with Regexes

Suppose I have two files main.txt and sub.txt. Suppose both files have unique lines i.e. the same line of text does not occur twice in either file. Also suppose there are no empty lines in either file. Now, consider the files as sets of strings, with each member of the set occuring on a line. This is possible because of our uniqueness condition. Now suppose sub.txt is a subset of main.txt in this way. How do we compute the set difference of main.txt and sub.txt to produce a new file diff.txt? To be clear, the lines of diff.txt should be those that occur in main.txt but not sub.txt. There should be no empty lines in diff.txt. Order in diff.txt is irrelevant.
Example
main.txt:
Hello
World
How
You
Are
sub.txt:
World
Hello
diff.txt:
How
Are
You
Bonus Questions
How can I tell that one set is actually a subset of the other? This is an assumption in the question, but in practice we mightn't know this for sure and would want a way to check it automatically.
How can I tell if the lines in each file are truly unique?
How can I tell if there are no blank lines?
Bonus Answer
I'll answer the bonus questions first. Follow these steps in order to ensure the right conditions hold as stated in the question:
Open both files in Notepad++ and close any other files
Lexographically sort each file: https://superuser.com/questions/762279/sorting-lines-in-notepad-without-the-textfx-plugin
Ensure that the following regex has no matches in either file, which will guarantee they're duplicate-free: ^(.+$\r\n)\1. If you want to remove duplicates, replace all ocurrences of that regex with \1.
Ensure there are no blank lines in either file by searching for ^$. If any are found you can delete them manually.
Create a third file and paste the contents of both sub.txt and main.txt into this file. Then lexographically sort it. Count the number of occurrences of the regex: ^(.+$)\r\n\1 to detect duplicate lines. If the count matches the number of lines in sub.txt, then it's a subset of main.txt. Keep this file for later.
Main Answer
In the third file you created in the last part, search for ^(.+$)\r\n\1\r?\n? and replace with the empty string. This will remove all elements of sub.txt from main.txt leaving you with diff.txt.
Note: This approach may leave you with a single blank line at the end of diff.txt, in the case where there was a duplicate found there. In that case, just delete it manually.

Incrementing Integer Inside Lines of a Text File

I have large text files with about 6 lines/instances of 3_xcalc_59 in which 59 is some 2-digit integer.
I am looking to increment these values of the text file by 1 every time I run the program.
I know I can increment a value defined in the code, but how can I increment an integer inside a line of text?
I was thinking the first part of the process would involve reading these lines and assigning them to string variables or a list, but I am not sure how to even do that.
I can find the lines by writing if line.startswith("3_xcalc") , but I'm not sure how to assign them to a list.
Simply writing
for line in open(inputfile, "w"):
line.startswith("3_xcalc") = listoflinesstartingwith3xcalc
Tells me "can't assign to function call", so that doesn't work, but I'm not sure what else to try.
Thank you.

Reading a line of a text file from a specific position in C++

I would like to read a text file in C++ in following manner:
Ignore the entire first line as it is simply meant as an introduction.
Only read the following lines from a specific position.
That starting position for reading is a fixed one and remains the same for every line; however, the numbers after that may be of variable length. I need to save all of these numbers from line 2 to line n into an Array.
At the moment I can read a regular 2D Array with getline.
How can I work around these things?
An example for a line I want to read could be:
Person1: 25 988.3 0.0023 7
To set the file to a position, use std::ifstream::seekg().
To set the file to the beginning of a line, you must read and count the line endings. Many text files have variable length text lines.
How can I work around these things?
You can't, unless you can ensure that all of the data lines after the first line are all the same length.
If you can't ensure that, then all you can do is read through all of the preceding lines.
An alternative I have employed in the past is to generate an 'index' of line start positions in a secondary file in binary format (so that I CAN jump directly to the right place in that file), and use that to jump to the right place in the text file. Of course that means that you need to regenerate that index file every time you replace/amend the data file.

Algorithm for writing limited number of lines to text file

I have a program where I need to write text lines to a log file very frequently. I would like to limit the number of lines in the log file to 1000. When I write lines to the file, it should append them normally. Once the file reaches 1000 lines, I'd like to get rid of the first line and then append the new one. Does anyone know if there is a way to do this without rewriting the entire file each time?
Generally it's a little bit better for a case like this to remove more than one line at a time from the beginning.
That is, if your limit is 1000 lines, and you hit 1000 lines, delete the first 300 or so, and then resume writing. That way, you're not performing the delete operation with every single line written thereafter, only every 300 times. If you need to persist 1000 lines, then instead keep up to 1300 and delete 300 when 1300 is reached.
All files have to be aligned to FS cluster size. So, no, there's no way. You can append a line to a file, but you can't delete the first line without file rewriting.
You can use 2 files by turns.
Or use some buffer in memory and flush it periodically.
I think you still have to scan the file to find out how many lines are in the file at this moment. In that case, you can put it in some sort of buffer that you could easily add and delete from.
Then you do your logging and when you are done, you could "re-write" the file with the buffer (or only last 1000 lines).
Other alternatives are discussed above.
And yeah, try to avoid deleting line-by-line. Generally, it is a costly operation.
I've found some similar topics here and on CodeProject:
Small logger class;
Flexible logger class using standard streams in C++
http://www.codeproject.com/Articles/584794/Simple-logger-for-Cplusplus
Hope you find them useful :)
Any time you want to log, you can open the file, read your write index, jump to the position, and write the fixed-width log entry. When your index hits your upper threshold, simply set it back to 0.
There are a lot of warnings with this, though - first is that each proper log entry (assuming you close the file in between) will require an open, a read, a seek, a write, a seek, a write and a close - to find your index, go to it, write the new entry, then update your index. You also have the inherent issues of writing a fixed-size data element. Also, a human reader will depend on your content to know where the "beginning" of the file is. Most people expect "line 1" to be the first line.
I'm a much bigger advocate for simply having a few files and "rolling" them, so that each file on its own is coherent, but if you want just one file with a fixed number of lines, the circular buffer idea can work.
When you only want to use one file, and the length of the lines are not constant, there is no way without rewriting the whole file.
Depending on how often you are appending to the file, I don't see any problem doing so. 1000 lines of approx 100 chars are only approx 100kb, which is not to much. Additionally you may add some hysteresis.
However:
If the line length is constant (or you hard-limit the line length to some constant), you could just overwrite the oldest line. But then you have to keep track of the log file positions of old/new lines
I would use two files: The first one where you append lines. When the file gets full, rename it to a second one, and fill the first one from the beginning.

Append data columnwise in C/C++

I want to add columns of data to a text file, one column in each iteration (one space between each column). If I open file for appending, it adds next column at the bottom of first column. Is it possible to append sideways?
All data isn't available at the start. Only one column of data becomes available in each iteration, and it gets lost in the next iteration.
Consider the file to be one long stream of characters, some of them just happen to be line breaks. Append always starts at the end of the file. If I'm reading you right you need to use seekp(seek new position to put new characters at) on your fstream to get to the right position before writing.
You know the format of your file, therefore you can calculate how much to skip in each line.
Something like this might work:
read line
while line != "":
skip forward the right number of " "
write new column
read new line