How do i find duplicate values from two different documents in Notepad++? - compare

So i have 2 files. One contains data in the format of(file 1):
user:pass|IP:PORT|
user:pass|IP:PORT|
....
The other file contains the data(file 2):
IP:PORT
IP:PORT
....
I need to see how many times each of the IP:PORT combination of file 2, appears in file 1.
The results need to be highlighted in file 1, to determine how many times each of the IP:PORT combinations appeared like so:
user:pass|123.123:4444|
user:pass|456.456:4444|
user:pass|123.123:4444|
So from this document, I would receive the info that "123.123:444" appears twice, and "456.456:4444" appears once. But only if they exist in the other document, I compared with.

Related

How to find string matches between a text file and CSV?

I am currently working with Python 2.7 on a stand alone system. My preferred method for solving this problem would be to use pandas data-frames to compare. However, I do not have access to install the library on the system I'm working with. So my question is how else could I use a text file and look for matches of the strings in a csv.
If I have a main csv file with many fields (for relevance the first one is timestamps) and several other text files that contain a list of timestamps how can I compare each of the txt files with the main csv and if a match is found grab the entire row from the csv based on the specific field matching and outputting that result to another csv
Example:
example.csv
timestamp,otherfield,otherfield2
1.2345,,
2.3456,,
3.4567,,
5.7867,,
8.3654,,
12.3434,,
32.4355,,
example1.txt
2.3456
3.4565
example2.txt
12.3434
32.4355
If there are any questions I'm happy to answer them.
You can load all the files into lists, then search the lists using
with open('example.txt','r') as file_handle:
example_file_content = file_handle.read().split("\n")
with open("example1.txt", "r") as file_handle:
example1_file_content = file_handle.read().split("\n")
for index, line in enumerate(example_file_content):
if line.split(",")[0] in example1_file_content:
print('match found; line is',line)

Regular Expression help to find letter in XML number field

I am getting an error importing an XML file into a custom program. Other files import correctly. However, one file produces an error from a float field. I am using Notepad++ search function with Regular Expression to try and find the issue in the XML file.
When I use <milepost>([a-zA-Z0-9.]+)</milepost> I get around 30,000 results which is the correct number of records but the field is supposed to be DOUBLE. When I use <milepost>([0-9.]+)</milepost> I only get 29,994 records. This tells me that the import is most likely failing because there are letters in my number fields.
I have tried a number of variations like:
<milepost>([\S\D\d]+)</milepost>
<milepost>(.*?)</milepost>
<milepost>([\Sa-zA-Z]+)</milepost>
<milepost>([0-9.\w]+)</milepost>
etc.
Each of these returns the expected 30,000 records.
When I try to search for letters using :
<milepost>([a-zA-Z.]*)</milepost>
<milepost>([a-zA-Z]+)</milepost>
<milepost>(^[a-zA-Z]+$)</milepost>
<milepost>([a-zA-Z.a-zA-Z]+)</milepost>
I get 0 results (most likely because it excludes numbers)
I did manage to find one of the records I am looking for using this method:
<milepost>173.811818181818a</milepost>
But I do not feel like scrolling through 30,000+ lines to look for 5 more records with a letter in them.
Is there a regular expression that will return to me ONLY the values that have a letter/letters in them while allowing numbers? (Fields with only numbers and a period should be excluded)
The 6 problem records presumably contain a mixture of letters and numbers, but your searches for records containing letters will only match records consisting exclusively of letters.
Try
<milepost>.*[a-zA-Z].*</milepost>
which matches any record containing an ASCII letter in its value, as well as allowing other characters such as digits.
What you want is a negative look-ahead. Something like
<milepost>(?![0-9.]+</milepost>)
should be very close.
In plain English <milepost> not followed by exclusively digits and dots and a closing </milepost>

Notepad++ - Selecting or Highlighting multiple sections of repeated text IN 1 LINE

I have a text file in Notepad++ that contains about 66,000 words all in 1 line, and it is a set of 200 "lines" of output that are all unique and placed in 1 line in the basic JSON form {output:[{output1},{output2},...}]}.
There is a set of characters matching the RegEx expression "id":.........,"kind":"track" that occurs about 285 times in total, and I am trying to either single them out, or copy all of them at once.
Basically, without some super complicated RegEx terms, I am stuck because I can't figure out how to highlight all of them at once, and also the Remove Unbookmarked Lines feature does not apply because this is all in one line. I have only managed to be able to Mark every single occurrence.
So does this require a large number of steps to get the file into multiple lines and work from there, or is there something else I am missing?
Edit: I have come up with a set of Macro schemes that make the process of doing this manually work much faster. It's another alternative but still takes a few steps and quite some time.
Edit 2: I intended there to be an answer for actually just highlighting the different sections all at once, but I guess that it not possible. The answer here turns out to be more useful in my case, allowing me to have a list of IDs without everything else.
You seem to already have a regex which matches single instances of your pattern, so assuming it works and that we must use Notepad++ for this:
Replace .*?("id":.........,"kind":"track").*?(?="id".........,"kind":"track"|$) with \1.
If this textfile is valid JSON, this opens you up to other, non-notepad++ options, like using Python with the json module.
Edited to remove unnecessary steps

Obtain all the intervals from a list of numbers that sometime repeat

I would like to find a method to obtain all the intervals from a list of files. The files represent pages of scanned documents. Sometimes the documents had several pages and the number of the documents appears several times, but most of the times the documents had just one page. What I would like to find out is which documents were scanned and which not, which numbers are missing from the list.
The list of files looks like this:
00001_DCT.jpeg
00002_DCT.jpeg
00003_1d2_DCT.jpeg
00003_2d2_DCT.jpeg
00004_1d3_DCT.jpeg
00004_2d3_DCT.jpeg
00004_3d3_DCT.jpeg
00005_1d9_DCT.jpeg
00005_2d9_DCT.jpeg
00005_3d9_DCT.jpeg
00005_4d9_DCT.jpeg
00005_5d9_DCT.jpeg
00005_6d9_DCT.jpeg
00005_7d9_DCT.jpeg
00005_8d9_DCT.jpeg
00005_9d9_DCT.jpeg
00006_1d4_DCT.jpeg
00006_2d4_DCT.jpeg
00006_3d4_DCT.jpeg
00006_4d4_DCT.jpeg
00007_DCT.jpeg
00008_DCT.jpeg
00009.jpeg
00010.jpeg
up to
24679.jpeg
24680_1d3.jpeg
24680_2d3.jpeg
24680_3d3.jpeg
24681_1d2.jpeg
24681_2d2_dct.jpeg
24682.jpeg
24683_1d2.jpeg
24683_2d2.jpeg
Which is the easier way to find the missing numbers?
I am assuming that if a document was scanned, it was scanned completely (ie not going from 1d3 to 3d3).
Loop through your file names, converting the first 5 characters into a number. Make sure the current file number is only greater than the previous file number by 1 or 0. If that case isn't met, there is a break (all files between the current file number and previous one are missing).

C++ - Randomly access lines of several text files

I have 10 text files (named file0.txt to file9.txt) with arbitrary lengths and number of lines. I need to randomly pick a file, randomly access 1-3 lines from that file, process them and repeat until all the lines of all the files have been processed. This only needs to be done once. For the sake of this question let's say "process" means print the lines. Does anyone have any suggestions on how I can go about doing this without loading all the text files into memory?
There's not really any way to 'randomly access' (in the sense that you can randomly access a vector) lines in a text file since the only way to find the lines is to search the file linearly for newlines. This means you'll at least need to stream through the files once to access lines even if you don't load them fully into memory.
You could achieve what you're describing by passing over all the files once to count the number of lines in them and then passing over them again to pull out randomly selected lines. I'm not sure what the benefit of that would be though. What are you really trying to achieve?
you can scan the file one to index where line starts, and keep that in memory (or even persist that if you need to do the same file more than once).
once you have that you can just seek into the line beginning and just read it till newline/eof before processing.
Suggestion:
1/ Make a copy of the files
2/ Erase a line when it is read
3/ update number of lines in file
That way you randomly pick a line that exist and that was not already read.
Lot of read/write...not efficient