Python: How to compare txt files - compare

There are two files a.txt and b.txt. The string in a.txt will be part of the string in b.txt, but not 100% the same
Under a.txt:
192.168.1.1
192.168.1.4
Under b.txt:
./192.168.1.1_Hello
./192.168.1.4_World
./tmp.txt
My final goal is to check the list in b.txt covers all IP in a.txt. If it is not correct, I want the program output those IP to me. Do you guys have any ideas on this ? it seems “diff” command on Python is not a good idea.

Related

regex in notepad++ or sed to return two different strings

I have a report that has information about a list of servers. I am wanting to search this list for uptime over a certain amount, and also the IP of the server. I have been using notepad++ to do the searching, but sed syntax would be ok too. The report has data like this:
some.dns.com
up 720 days,
some version
several lines of disk space information, between 14 and 16 lines
Connection to 10.1.1.1 closed.
some.other.dns
up 132 days,
some version
several lines of disk space information, between 14 and 16 lines
Connection to 10.1.1.2 closed.
I've come up with the following so far, which gives me the uptime threshold I need:
up ([9-9]\d|\d{3,} days,)
But I also need the IP addresses to make sense of it, and haven't been able to figure out a way to get JUST the IPs related to the servers with high uptime.
I've found something like this to find IP addresses:
((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))
So, I was hoping to return something like the following:
up 720 days,
10.1.1.1
You may actually use awk:
awk -F"\n" -v RS="" '$0 ~ /up (9[0-9]|[0-9]{3,}) days/{gsub(/Connection to | closed\./, "", $NF); print $1 "\n" $NF}' file > newfile
See the online demo
The file is read paragraph by paragraph, and fields are separated with a newline. If a record matches up (9[0-9]|[0-9]{3,}) days pattern (up with a space, then 9 followed with any digit or any 3 digits followed with space and days, then the last field ($NF) is stripped from the static text and the first and last fileds are printed.

Searching for an unknown IP using FINDSTR

I have text files with hundreds of entries like those below. They mostly come in pairs of 2 IPs. Sometimes they come as 3 IPs. I am trying to find that third IP that is always in the middle of the stack (syntax below). There are maximum 3 different IPs in each file at all times. It is possible that some text files won’t have that middle IP (its occurrence is quite rare). How do I write the search command to find the middle IP from mentioned stacks if there is one in the text file? OS: Win7.
Text file sample syntax:
- saving IP addresses
* 192.168.1.1
* 111.111.222.222
- over
- saving IP addresses
* 192.168.1.1
* 11.123.11.123
* 111.111.222.222
- over
- saving IP addresses
* 192.168.1.1
* 111.111.222.222
- over
I have tried findstr \-.*\*.*\*.*\- pathtofile.txt This should return the block of 3 IPs if there is such block in the file but it didn't work.
Assuming your real file isn't double-spaced like your sample, the following will output the first line (saving...) and line number of matching blocks. Your real problem is findstr will only output one line even if you are matching across lines, so you will never get the whole block output. You need a better tool.
Note: I am using the JPSoft Take Command escape character to put in CR and LF, but you can create them in real batch files as well, though it isn't easy.
findstr /n /R saving.*^r^n.*\..*\..*\..*^r^n.*\..*\..*\..*^r^n.*\..*\..*\..*^r^n sampleIPinput.txt

How to list out specific files with dates in the file name in unix shell

I apologize if I'm asking a duplicate question. My google skills have temporarily left me and I don't know how to word my question to find an answer.
I'm working on a unix server and I have a bunch of files with this format as the file name:
alert_YYYY-MM-DD.tsv
I'm trying to list out a specific range of files according to date but for some reason, it's not working how I need it to work.
This works for some reason:
$ ls alert_2015-08-0[1-9].tsv
alert_2015-08-01.tsv alert_2015-08-04.tsv alert_2015-08-07.tsv
alert_2015-08-02.tsv alert_2015-08-05.tsv alert_2015-08-08.tsv
alert_2015-08-03.tsv alert_2015-08-06.tsv alert_2015-08-09.tsv
But this does not, even though the files are present:
$ ls alert_2015-08-[10-15].tsv
ls: alert_2015-08-[10-15].tsv: No such file or directory
$ ls alert_2015-08-1[0-5].tsv ### but the files exist in the directory!!
alert_2015-08-10.tsv alert_2015-08-12.tsv
alert_2015-08-11.tsv alert_2015-08-14.tsv
I'm also trying to span from the single digits to the teens or teens to 20s but it also isn't working, although, it might be more difficult than above:
e.g.
$ ls alert_2015-08-[09-15].tsv
ls: alert_2015-08-[09-15].tsv: No such file or directory
I'm not sure if this is a glob question or regex question but it's very annoying since logically what I wrote above should work but it doesn't and I don't understand why, other than I have the syntax wrong. Thanks in advance for any help.
the number in [] should be at most 0-9.
$ ls alert_2015-08-[0-1][0-9].tsv
alert_2015-08-01.tsv alert_2015-08-10.tsv alert_2015-08-14.tsv
alert_2015-08-04.tsv alert_2015-08-11.tsv
alert_2015-08-07.tsv alert_2015-08-12.tsv

Find matches between 2 files

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$
Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are

Regexp pattern matching IP and UserAgent in an Huge File

I have a huge log file that has a structure like this:
ip=X.X.X.X
userAgent=Firefox
-----
Referer=hxxp://www.bla.org
I want to create a custom output like this:
ip:userAgent
for ex:
X.X.X.X:Firefox
and the pattern will ignore lines which don't start with ip= and userAgent=. (these two must form a pair as i mentioned above.)
I am a newbie administrator and our client needs a sorted file immediately.
Any help will be wonderful.
Thanks.
^ip=(\d+(?:\.\d+){3})[\r\n]+userAgent=(.+)$
Apply in global + multiline mode.
Group 1 will contain the IP, group 2 will contain the user agent string.
Edit: The above expression can be simplified a bit, we can remove the IP address format checking - assuming that there will be nothing but real IP addresses in the log file:
^ip=(\d+\.?)+[\r\n]+userAgent=(.+)$
You can use:
^ip=((?:[0-9]{1,3}\.){3}[0-9]{1,3})$
And
^userAgent=(.*)$
Get the group 1 for both and you will have the desired data.
give it a try (this is in no way robust if there are lines where your log file differs from the example snippet above):
sed -n -e '/^ip=/ {s///
N
s/\nuserAgent=/:/
p
}' HugeFile > customoutput