Set Difference in Notepad++ with Regexes - regex

Suppose I have two files main.txt and sub.txt. Suppose both files have unique lines i.e. the same line of text does not occur twice in either file. Also suppose there are no empty lines in either file. Now, consider the files as sets of strings, with each member of the set occuring on a line. This is possible because of our uniqueness condition. Now suppose sub.txt is a subset of main.txt in this way. How do we compute the set difference of main.txt and sub.txt to produce a new file diff.txt? To be clear, the lines of diff.txt should be those that occur in main.txt but not sub.txt. There should be no empty lines in diff.txt. Order in diff.txt is irrelevant.
Example
main.txt:
Hello
World
How
You
Are
sub.txt:
World
Hello
diff.txt:
How
Are
You
Bonus Questions
How can I tell that one set is actually a subset of the other? This is an assumption in the question, but in practice we mightn't know this for sure and would want a way to check it automatically.
How can I tell if the lines in each file are truly unique?
How can I tell if there are no blank lines?

Bonus Answer
I'll answer the bonus questions first. Follow these steps in order to ensure the right conditions hold as stated in the question:
Open both files in Notepad++ and close any other files
Lexographically sort each file: https://superuser.com/questions/762279/sorting-lines-in-notepad-without-the-textfx-plugin
Ensure that the following regex has no matches in either file, which will guarantee they're duplicate-free: ^(.+$\r\n)\1. If you want to remove duplicates, replace all ocurrences of that regex with \1.
Ensure there are no blank lines in either file by searching for ^$. If any are found you can delete them manually.
Create a third file and paste the contents of both sub.txt and main.txt into this file. Then lexographically sort it. Count the number of occurrences of the regex: ^(.+$)\r\n\1 to detect duplicate lines. If the count matches the number of lines in sub.txt, then it's a subset of main.txt. Keep this file for later.
Main Answer
In the third file you created in the last part, search for ^(.+$)\r\n\1\r?\n? and replace with the empty string. This will remove all elements of sub.txt from main.txt leaving you with diff.txt.
Note: This approach may leave you with a single blank line at the end of diff.txt, in the case where there was a duplicate found there. In that case, just delete it manually.

Related

How to ignore specific charactor and new line using regex

I am trying to validate a csv file using Apache-NiFi.
My CSV file has some defects.
id,name,address
1,sachith,{"Lane":"ABC.RTG.EED","No":"12"}
2,nalaka,{"Lane":"DEF",
"No":"23"}
3,muha,{"Lane":"GRF.FFF","No":"%$&%*^%"}
Here in second row,its been divided into two lines and third row has some special characters.
I want to ignore both the lines. For that I use \{("\w+":"\w+",)*[^%&*#]*\}, but this is not capturing row split error and new line.
I also used \{("\w+":"\w+",)*[^%&*#]*\}$, but it doesnt even get the right answer.
This is you might looking for: ^[0-9]+,[a-z]+,\{("\w+":"[\w\.]+","\w+":"[a-zA-Z0-9]+")\}$

Remove line entirely (not to leave it empty)

This is what I have in doc:
1;01878916;BC101;FALSE
16;01978916;BC101;FALSE
17;0195B4E5;BC101;FALSE
19;0197D016;BC101;FALSE
After I run find&replace: ^((1|17);.+?)$ with: empty it leaves
blankrow
16;01978916;BC101;FALSE
blankrow
19;0197D016;BC101;FALSE
and then I have to run find and replace \s+$ in order to remove empty line(s) and manually remove first empty line.
Im weak with regex, tried to combine those 2 commands into one.
How it should be, to remove entirely empty rows, without leaving empty row?
To get
16;01978916;BC101;FALSE
19;0197D016;BC101;FALSE
Thanks in advance. I need to have regex commands in order to run FIND and Replace in all open files, because I'm doing this in 10 files at once. Line operations > Remove blank lines is not an option.
The regex:
^(1|17);.+?\s+
mentioned above works well here as there is no whitespace at the beginning of the lines you want to keep. If that's ever not the case, you can also do:
\s+^(1|17);.+?$

Getting Beyond Compare to Match Similar Lines Properly

I am using Beyond Compare 4.1.6 to diff text configuration files. There is one configuration parameter per line, and each line is formatted as follows:
:=
I would like to configure Beyond Compare such that it will align only lines when the : portion of the line is exactly the same in both files. Put differently, everything from the beginning of the line up to and including the colon must match exactly for the two lines to be aligned. Note that a colon cannot occur in , so the colon I want Beyond Compare to base its alignment decision on will always be the first colon in the line.
An example is:
# FILE 1
abcdefgh:string=5
# FILE 2
abcdefkh:string=5
Beyond Compare aligns these two lines even though I don't want it to.
I've been unable to coerce Beyond Compare to compare lines as desired by editing its grammar rules or by tweaking other features.
How may I get Beyond Compare to match lines as described above?
Thank you!
You can compare it with a table compare.
Then you must set the = as field separator:
When you did this, you have two columns and the first is the key columns (if not, you can define it).
After this you get the result you want (if I understood your question right):
If you need it often, you may store the setting in a file format.

Notepad++ - Selecting or Highlighting multiple sections of repeated text IN 1 LINE

I have a text file in Notepad++ that contains about 66,000 words all in 1 line, and it is a set of 200 "lines" of output that are all unique and placed in 1 line in the basic JSON form {output:[{output1},{output2},...}]}.
There is a set of characters matching the RegEx expression "id":.........,"kind":"track" that occurs about 285 times in total, and I am trying to either single them out, or copy all of them at once.
Basically, without some super complicated RegEx terms, I am stuck because I can't figure out how to highlight all of them at once, and also the Remove Unbookmarked Lines feature does not apply because this is all in one line. I have only managed to be able to Mark every single occurrence.
So does this require a large number of steps to get the file into multiple lines and work from there, or is there something else I am missing?
Edit: I have come up with a set of Macro schemes that make the process of doing this manually work much faster. It's another alternative but still takes a few steps and quite some time.
Edit 2: I intended there to be an answer for actually just highlighting the different sections all at once, but I guess that it not possible. The answer here turns out to be more useful in my case, allowing me to have a list of IDs without everything else.
You seem to already have a regex which matches single instances of your pattern, so assuming it works and that we must use Notepad++ for this:
Replace .*?("id":.........,"kind":"track").*?(?="id".........,"kind":"track"|$) with \1.
If this textfile is valid JSON, this opens you up to other, non-notepad++ options, like using Python with the json module.
Edited to remove unnecessary steps

Compare files and return only the differences using Notepad++

Notepad++ has a Compare Plugin tool for comparing text files, which operates like this:
Launch Notepad++ and open the two files you wish to run a comparison
check on.
Click the “Plugins” menu,
Select “Compare” and click “Compare.”
The plugin will run a comparison check and display the two files side
by side, with any differences in the text highlighted.
This is a nice feature, and which I have used happily for some time. Now, I have been looking for an option to go further and select the highlighted differing lines (e.g. by deleting the non-highlighted ones), or vice versa: i.e. expunge the highlighted lines.
Is there a straightforward way to achieve this?
To substract two files in notepad++ (file1 - file2) you may follow this procedure:
Recommended: If possible, remove duplicates on both files, specially if the files are big. To do this: Edit => Line operations => Sort Lines Lexicographically Ascending (do it on both files)
Add ---------------------------- as a footer on file1 (add at least 10 dashes). This is the marker line that separates file1 content from file2.
Then copy the contents of file2 to the end of file1 (after the marker)
Control + H
Search: (?m-s)^(?:-{10,}+\R[\s\S]*+|(.*+)\R(?=(?:(?!^-{10,}$)-++|[^-]*+)*+^-{10,}+\R(?:^.*+\R)*?\1(?:\R|\z))) note: use case sensitivity according to your needs
Replace by: (leave empty)
Select Regular expression radio button
Replace All
You can modify the marker if It is possible that file1/file2 can have lines equal to the marker. In that case you will have to adapt the regular expression.
By the way, you could even record a macro to do all steps (add the marker, switch to file2, copy content to file1, apply the regex with a single button press.
Edited:
Changed the regex to add some improvements:
Speed related:
Avoid as much backtracking as possible
Avoid searching after the mark
Usability:
Dashes are allowed for the lines. But the separator is still ^-{10,}$
Works with other characters besides words
Speed comparison:
New method vs Old method
So basically 78ms vs 1.6seconds. So a nice improvement! That makes comparing Kilobyte-sized files possible.
Still you may want to use some dedicated program for comparing or substracting bigger files.
If the number of differences is not large, a quicker method might be just bookmarking each differing line using keyboard shortcuts. Starting from the beginning of the file, press Alt+Page Down to focus on the first difference, and then press Ctrl+F2 to bookmark it. Continue with alternatingly pressing Alt+Page Down and Ctrl+F2 until the last difference.
With all the differing lines bookmarked, you can use any of the operations under "Search -> Bookmarks" menu:
Cut Bookmarked Lines
Copy Bookmarked Lines
Paste to (Replace) Bookmarked Lines
Remove Bookmarked Lines
Remove Unmarked Lines
I have a dirty workaround for this. It saves some time compared to Control+C, Alt+Tab, Control+V; Control+C, Alt+Tab, Control+V; ... but It may not be worth on big files or if the differences for both files are big. For bigger files you may prefer using some other tool.
Typically this works best when comparing group of 'words' and does not work with content that is tabulated (like source code)
So the workaround is:
Optional: (depends on the content that's being compared) Sort both files (it will make the future comparison easier) To do this: Edit => Line operations => Sort Lines Lexicographically Ascending (do it on both files)
Compare files with the plugin
Choose one file and inspect the lines you want to keep. Add one tabulator before each of those lines. Remeber you can select several lines and press tab for tabulating them. Optionally, you may add tabulators to the lines you want to remove
Sort the file. The tabulated lines will come up first. So now you can copy-paste them (or copy-paste the untabulated ones)
move the files to a linux box and then execute diff command:
$ diff file1.txt file2.txt > file_diff.txt