How do I extract only the IPs from the data? - regex

Trying to pull some logs and break it down. The following regex match gives me a correct match for all 4 IPs: ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}+) but not sure how to either delete the rest of the data or "extract" the IPs. Only need the IPs as shown.
June 3rd 2020, 21:18:02.193 [2020-06-03T21:18:02.781503+00:00,192.168.5.134,0,172.16.139.61,514,rslog1,imtcp,]<183>Jun 3 21:18:02 005-1attt01 atas_ssl: 1591219073.296175 CAspjq31LV8F0b 146.233.244.131 38530 104.16.148.244 443 - - - www.yahoo.com F - - F - - - - - - -
June 3rd 2020, 21:18:02.193 [2020-06-03T21:18:02.781503+00:00,192.168.5.134,0,172.16.139.61,514,rslog1,imtcp,]<183>Jun 3 21:18:02 005-1attt01 atas_ssl: 1591219073.296175 CAspjq31LV8F0b 146.233.244.131 38530 104.16.148.244 443 - - - www.yahoo.com F - - F - - - - - - -
Need this:
192.168.5.134 172.16.139.61 146.233.244.131 104.16.148.244
192.168.5.134 172.16.139.61 146.233.244.131 104.16.148.244

Related

Python 2.7: read a txt file, split and group a few column count from right

Due to the txt file has some flaw, the .txt file need to split from the right. below is some part f the files. Notice that the first row has only 4 columns and the other row has 5 columns. I want the data from the 2nd, 3rd, and 4th columns from the right
5123 - SENTRAL REIT - SENTA.KL - [$SENT]
KIPT - 5280 - KIP REAL EST - KIPRA.KL - [$KIPR]
ALIT - 5269 - AL-SALAM REAL - ALSAA.KL - [$ALSA]
KLCC - 5235SS - KLCC PROP - KLCCA.KL - [$KLCC]
IGBgggREIT - 5227 - IGB RT - IGREA.KL - [$IGRE]
SUNEIT - 5176 - SUNWAY RT - SUNWA.KL - [$SUNW]
ALA78QAR - 5116 - AL-AQAR HEA RT - ALQAA.KL - [$ALQA]
I want the file to be saved in .csv and can be read by pandas later
The desired output is
Code,Company,RIC
5123,SENTRAL REIT,SENTA.KL
5280,KIP REAL EST, KIPRA.KL
5269,AL-SALAM REAL,ALSAA.KL
5235SS,KLCC PROP,KLCCA.KL
5227,IGB RT,IGREA.KL
5176,SUNWAY RT,SUNWA.KL
5116,AL-AQAR HEA RT,ALQAA.KL
My code is below
>>> with open('abc.txt', 'r') as reader:
>>> [x for x in reader.read().strip().split(' - ') if x]
It returns a list and I unable to group the to the right column due to the flaw of the list (unequal columns in some rows if it is counted from left)
Please advise how to get the desired output
This should do the trick :)
import pandas as pd
with open('abc.txt', 'r') as reader:
data = [line.split(' - ')[-4:-1] for line in reader.readlines()]
df = pd.DataFrame(columns=['Code', 'Company', 'RIC'], data=data)
df.to_csv('abc.csv', sep=',', index=0)

QAbstractProxyModel: How to implement methods?

I have a TreeModel with a database like structure:
+Table1
-"Table1_key" -"name"
- 1 -"John"
- 2 -"Peter"
-...
+Table2
-"Table2_key" -"Table1_key" -"value1" -"value2"
- 1 - 1 - 1000 - 20000
- 2 - 1 - 3000 - 4000
- 3 - 2 - 1000 -2000
-...
+...
So the Data comes from an .xml file and is displayed in a TreeView which works just fine.
However i want to display some of the Tables in different Views and resolve the keys in that view.
For the example model the required View could look like this:
+"John"
-"value1" -"value2"
- 1000 - 20000
- 3000 - 4000
+"Peter"
- 1000 -2000
I guess using a QAbstractProxyModel would be the proper way.
So my question is: How do i implement this?
I can't find any examples and i have no idea how to map between the indexes of the source and the proxymodel in the mapToSource/mapFromSource methods.

Split string, extract and add to another column regex BIGQUERY

I have a table with Equipment column containing strings. I want to split string, take a part of it and add this part to a new column (SerialNumber_Asset). Part of the string i want to extract always has the same pattern: A + 7 digits. Example:
Equipment SerialNumber_Asset
1 AXION 920 - A2302888 - BG-ADM-82 -NK A2302888
2 Case IH Puma T4B 220 - BG-AEH-87 - NK null
3 ARION 650 - A7702047 - BG-ADZ-74 - MU A7702047
4 ARION 650 - A7702039 - BG-ADZ-72 - NK A7702039
My code:
select x, y, z,
regexp_extract(Equipment, r'([\A][\d]{7})') as SerialNumber_Asset
FROM `aa.bb.cc`
The message i got:
Cannot parse regular expression: invalid escape sequence: \A
Any suggestions what could be wrong? Thanks
Just use A instead of [\A], check example below:
select regexp_extract('AXION 920 - A2302888 - BG-ADM-82 -NK', r'(A[\d]{7})') as SerialNumber_Asset

String split with spaces using regex

I am trying to split a string using regex. I need to use regex in nifi to split a string into groups. Could anyone helps me how to split below string using regex.
or how can we give specific occurrence number of delimiter to split the string. For example in the below string how can I specify that I want a string after 3rd occurrence of space.
Suppose i have a String
"6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
And I want result something like this :
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-
addr(4)arpa(0)
Could anyone help me. Thanks in advance.
If it's really just certain spaces you want to have for delimiters you can do something like this to avoid a fixed width nightmare:
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
Pretty much it's what it looks like, groups of NON spaces \S+ with spaces \s and each is grouped with parans. The .* at the end is just the rest of the line, it can be adjusted as needed. If you wanted each group to be every non spaced group you can do a split instead of regex, but it looks like that isn't what is desired. I don't have access to nifi to test, but here is an example in Python.
import re
text = "6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"
match = re.search(regex, text)
print ("group 1 - " + match.group(1))
print ("group 2 - " + match.group(2))
print ("group 3 - " + match.group(3))
print ("group 4 - " + match.group(4))
print ("group 5 - " + match.group(5))
print ("group 6 - " + match.group(6))
print ("group 7 - " + match.group(7))
print ("group 8 - " + match.group(8))
Output:
group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Are you trying to extract every group into a separate attribute? This is certainly possible in "pure" NiFi, but with lines this long, it may make more sense to use ExecuteScript processor to use Groovy or Python's more complex regular expression handling in conjunction with String#split() and provide a script like sniperd posted.
To perform this task using ExtractText, you'll configure it as follows:
Copyable patterns:
group 1: (^\S+\s\S+\s\S+)
group 2: (?i)(?<=\s)([a-f0-9]{4})(?=\s)
group 3: (?i)(?<=\s)(PACKET\s[a-f0-9]{4,16})(?=\s)
group 4: (?i)(?<=\s\S{16}\s)([\w]{3,})(?=\s)
group 5: (?i)(?<=\s.{3}\s)([\w]{3,})(?=\s)
group 6: (?i)(?<=\s.{3}\s)([\d\.]{7,15})(?=\s)
group 7: (?i)(?<=\d\s)([a-f0-9]{4})(?=\s)
group 8: (?i)(?<=\d\s[a-f0-9]{4}\s)(.*)$
It is important to note that Include Capture Group 0 is set to false. You will get duplicate groups (group 1 and group 1.1) due to the way regex expressions are validated in NiFi (currently all regexes must have at least one capture group -- this will be fixed with NIFI-4095 | ExtractText should not require a capture group in every regular expression).
The resulting flowfile has the attributes properly populated:
Full log output:
2017-06-20 14:45:57,050 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=c6b04310-015c-1000-b21e-c64aec5b035e] logging for flow file StandardFlowFileRecord[uuid=5209cc65-08fe-44a4-be96-9f9f58ed2490,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1497984255809-1, container=default, section=1], offset=444, length=148],offset=0,name=1920315756631364,size=148]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'lineageStartDate'
Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'fileSize'
Value: '148'
FlowFile Attribute Map Content
Key: 'filename'
Value: '1920315756631364'
Key: 'group 1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 1.1'
Value: '6/19/2017 12:14:07 PM'
Key: 'group 2'
Value: '0FA0'
Key: 'group 2.1'
Value: '0FA0'
Key: 'group 3'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 3.1'
Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 4'
Value: 'UDP'
Key: 'group 4.1'
Value: 'UDP'
Key: 'group 5'
Value: 'Snd'
Key: 'group 5.1'
Value: 'Snd'
Key: 'group 6'
Value: '11.222.333.44'
Key: 'group 6.1'
Value: '11.222.333.44'
Key: 'group 7'
Value: '93c8'
Key: 'group 7.1'
Value: '93c8'
Key: 'group 8'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'group 8.1'
Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'path'
Value: './'
Key: 'uuid'
Value: '5209cc65-08fe-44a4-be96-9f9f58ed2490'
--------------------------------------------------
6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)
Another option with the release of NiFi 1.3.0 is to use the record processing capabilities. This is a new feature which allows arbitrary input formats (Avro, JSON, CSV, etc.) to be parsed and manipulated in a streaming manner. Mark Payne has written a very good tutorial here that introduces the feature and provides some simple walkthroughs.

How to write a program to rename mp4 files to match the name of srt files?

So I have downloaded the mp4 and srt files for "Introduction to computer networks" course from Coursera. But there is slight discrepancy between the names of mp4 and srt files.
The file name samples are following:
1 - 1 - 1-1 Goals and Motivation (1253).mp4
1 - 1 - 1-1 Goals and Motivation (12_53).srt
1 - 2 - 1-2 Uses of Networks (1316).mp4
1 - 2 - 1-2 Uses of Networks (13_16).srt
1 - 3 - 1-3 Network Components (1330).mp4
1 - 3 - 1-3 Network Components (13_30).srt
1 - 4 - 1-4 Sockets (1407).mp4
1 - 4 - 1-4 Sockets (14_07).srt
1 - 5 - 1-5 Traceroute (0736).mp4
1 - 5 - 1-5 Traceroute (07_36).srt
1 - 6 - 1-6 Protocol Layers (2225).mp4
1 - 6 - 1-6 Protocol Layers (22_25).srt
1 - 7 - 1-7 Reference Models (1409).mp4
1 - 7 - 1-7 Reference Models (14_09).srt
1 - 8 - 1-8 Internet History (1239).mp4
1 - 8 - 1-8 Internet History (12_39).srt
1 - 9 - 1-9 Lecture Outline (0407).mp4
1 - 9 - 1-9 Lecture Outline (04_07).srt
2 - 1 - 2-1 Physical Layer Overview (09_27).mp4
2 - 1 - 2-1 Physical Layer Overview (09_27).srt
2 - 2 - 2-2 Media (856).mp4
2 - 2 - 2-2 Media (8_56).srt
2 - 3 - 2-3 Signals (1758).mp4
2 - 3 - 2-3 Signals (17_58).srt
2 - 4 - 2-4 Modulation (1100).mp4
2 - 4 - 2-4 Modulation (11_00).srt
2 - 5 - 2-5 Limits (1243).mp4
2 - 5 - 2-5 Limits (12_43).srt
2 - 6 - 2-6 Link Layer Overview (0414).mp4
2 - 6 - 2-6 Link Layer Overview (04_14).srt
2 - 7 - 2-7 Framing (1126).mp4
2 - 7 - 2-7 Framing (11_26).srt
2 - 8 - 2-8 Error Overview (1745).mp4
2 - 8 - 2-8 Error Overview (17_45).srt
2 - 9 - 2-9 Error Detection (2317).mp4
2 - 9 - 2-9 Error Detection (23_17).srt
2 - 10 - 2-10 Error Correction (1928).mp4
2 - 10 - 2-10 Error Correction (19_28).srt
I want to rename the mp4 files to match srt files so that vlc can automatically load the subtitles when I play the videos. What someone discuss be algorithms to do this? You can also provide solution code in any language as I am familiar with many programming languages. But python and c++ are preferable.
Edit:
Thanks to everyone who replied. I know it is easier to rename the srt files than the other way around. But I think it will be more interesting to rename the mp4 files. Any suggestions?
Here's a quick solution in python.
The job is simple if you make the following assumptions:
all files are in the same folder
you have the same number of srt and mp4 files in the directory
all srt are ordered alphabetically, all mp4 are ordered alphabetically
Note I do not assume anything about the actual names (e.g. that you only need to remove underscores).
So you don't need any special logic for matching the files, just go one-by-one.
import os, sys, re
from glob import glob
def mv(src, dest):
print 'mv "%s" "%s"' % (src, dest)
#os.rename(src, dest) # uncomment this to actually rename the files
dir = sys.argv[1]
vid_files = sorted(glob(os.path.join(dir, '*.mp4')))
sub_files = sorted(glob(os.path.join(dir, '*.srt')))
assert len(sub_files) == len(vid_files), "lists of different lengths"
for vidf, subf in zip(vid_files, sub_files):
new_vidf = re.sub(r'\.srt$', '.mp4', subf)
if vidf == new_vidf:
print '%s OK' % ( vidf, )
continue
mv(vidf, new_vidf)
Again, this is just a quick script. Suggested improvements:
support different file extensions
use a better cli, e.g. argparse
support taking multiple directories
support test-mode (don't actually rename the files)
better error reporting (instead of using assert)
more advanced: support undoing
If all those files really follow this scheme the Python implementation is almost trivial:
import glob, os
for subfile in glob.glob("*.srt")
os.rename(subfile, subfile.replace("_",""))
If your mp4 also contain underscores you want to add an additional loop for them.
for f in *.srt; do mv $f ${f%_}; done
This is just an implementation of what Zeta said :)
import os;
from path import path;
for filename in os.listdir('.'):
extension = os.path.splitext(path(filename).abspath())[1][1:]
if extension == 'srt':
newName = filename.replace('_','');
print 'Changing\t' + filename + ' to\t' + newName;
os.rename(filename,newName);
print 'Done!'