Regex to extract similar config blocks throughout ini file - regex

I want to extract the hostname and node id values from the [ndb_mgmd] blocks only, from the mysql ini config file. There are other blocks in the file which are not needed.
Is it possible to extract just the hostname and node id values from all the [ndb_mgmd] blocks?
Otherwise, how can I extract those whole blocks using regex (without the [ndb_mgmd] headers preferably)?
Config file example below.
Random unwanted text in config file
[unwanted_block]
hostname=0.0.0.0
NodeId=2
[ndb_mgmd]
hostname=2.2.2.2
NodeId=1
[mysql_unwanted]
hostname=3.3.3.3
NodeId=6
[ndb_mgmd]
hostname=2.2.2.2
NodeId=4
randomconfig
Thanks

You can extract what you need with capture groups:
Hostnames: ^hostname=(\d\.\d\.\d\.\d)$
NodeIds: ^NodeId=(\d)$

I've played around and the following regex seems to work in playground (may not be the most ideal solution), however it's not working in ServiceNow where I'm trying to use.
/^\[ndb_mgmd\][\r\n]?hostname=(.*)[\r\n]?NodeId=(.*)/gm

Related

Regular Expression to select file paths from list of URL

I have a list of URLs in different format that were extracted from a random website:
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png
http://www.boston.com/weather?p1=BGMenu_SubnavBostonGlobe.com
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--full.png
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--bug.png
https://www.bostonglobe.com
https://www.bostonglobe.com
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
http://www.boston.com/section/cars?s_campaign=bg:hp:mainnav:cars
http://realestate.boston.com?s_campaign=bg:hp:mainnav:realestate
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
They all are in different format (optional http/https/www). I need to filter it to get any kind of "downloadable" content such as *jpg, *png, *html, etc.
Expected output:
/bg-images/png/search-magnifying-glass.png
/bg-images/png/search-magnifying-glass.png
/bg-images/png/bg-logo--full.png
/bg-images/png/bg-logo--bug.png
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking (not sure about these yet just in case)
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
this is my first time trying to write regex, and I came up with something like that:
(https?/\/)?(www\.)?[-a-zA-Z0-9#:;%._\+~\/#=]{2,256}\.[a-z]{2,4}a{0,1}\b([-a-zA-Z0-9#:;!%_\+.,~#?&//=]*)
which outputs a lots of trash lines. Any advice?
Since your sample Input_file is having space at last of the lines so I am using sub to remove those spaces, in case they are not there then you could remove it. Could you please try following and let me know if this helps you.
awk '{sub(/ *$/,"")}
(/^http/||/^https/||/^www/||/^\//) && \
(/.*png$/||/.*html$/||/.*jpg$/||/BGHeader_SmartBar_Breaking$/)
' Input_file
Instead of fetching some questionable URL from some questionable feed, you need to manually check them, because URL in general, DO NOT contain information about it's content. Many storage services uses ID to identify image, not names with extensions. But headers do contain this information:
How to get content type of a web address?
So as to what is downloadable? Everything. I mean literaly everything you see is downloadable. For example, for images content types will be something like these:
image/gif, image/png, image/jpeg, image/bmp, image/webp
For audio/video:
audio/midi, audio/mpeg, audio/webm, audio/ogg, audio/wav
Partially full list can be found here: http://htmlbook.ru/html/value/mime
As to solution - just sniff every link in multiple IO threads. This way you also will be able to filter those which need some authentication, were expired or invalid in first place. Usually its pretty cheap requests.

Adapting Regular Expression in Django URL to match filepath

So I am currently working on a web application that takes as input the location of a malware file for one of the functions.
This is passed via the views file. However after some altering of the models section of the application I found it was unable to parse the full filepath.
The code below works for the following pcap as input:
8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap
url(r'^analyse/(?P<pcap>[\w\-]+\.pcap)$', views.analyse, name='analyse'),
However this code no longer works when it is a pcap containing the full filepath.
/home/freddie/malwarepcaps/8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap
Any suggestions or pointers on how exactly I would alter the regular expression to accomodate the full filepath in the string being passed to the route would be very much appreciated.
regex: ((/\w+?)+/)?([\w-]+\.pcap)
django regex: ^analyse(?P<pcap>((/\w+?)+/)?([\w-]+\.pcap))$
note that there is no slash after analyse because it's part of pcap now.
so analyse/home/freddie/malwarepcaps/foo-bar.pcap should match this pattern and pcap will be equal to /home/freddie/malwarepcaps/foo-bar.pcap
test:
https://pythex.org/?regex=((%2F%5Cw%2B%3F)%2B%2F)%3F(%5B%5Cw-%5D%2B%5C.pcap)&test_string=8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap%20%0A%2Fhome%2Ffreddie%2Fmalwarepcaps%2F8cdddcd3-35fa-468d-8647-816518a9836a435be1c6e904836ad65f97f3eac4cbe19ee7ba0da48178fc7f00206270469165.pcap&ignorecase=0&multiline=0&dotall=0&verbose=0
PS: I think it's better to move such parameter (path - /home/f/m/f.pcap) into querystring (for GET request) or into http-body (for POST request)
so it will be easier to obtain param without url-matching

Batch rename URLs in XML file

I have a list with URLs and IPs for Office365 in XML format. Now I'd like to either write a script or use a text editor's search and replace function (regex) to automatically change some of these URLs.
Example:
These URLs
<address>scus-odc.officeapps.live.com</address>
<address>scus-roaming.officeapps.live.com</address>
<address>sea-odc.officeapps.live.com</address>
Should be changed to
<address>*.officeapps.live.com</address>
<address>*.officeapps.live.com</address>
<address>*.officeapps.live.com</address>
I would appreciate any input on this issue. Thanks in advance.
Here is what I have tried so far:
1)Search for ..(?=[^.].[^.]*$) and replace with an empty string.This does a good job but unfortunately it removes the preceeding as well...
2)As pointed out by Tim, the list consists of FQDNs with different domains.The list is available from https://go.microsoft.com/fwlink/?LinkId=533185 (This list includes all FQDNs - The IPs will get deleted)
3) Solved with the help of Sergio's input. The solution was to
search for (>)[^.\n\s]+ and substitute with \1\*
I will have to write another script to delete the multiple domains but that was not part of the question so I consider this issue closed. Thank you for your input.
You can use the regex:
(>)[^.\n\s]+
and substitute with \1\*

Applescript to extract the Digital Object Identifier (DOI) from a PDF file

I looked for an applescript to extract the DOI from a PDF file, but could not find it. There is enough information available on the actual format of the DOI (i.e. the regular expression), but how could I use this to get the identifier from the PDF file?
(It would be no problem if some external program were used, such as Hazel.)
If you're ok with using an app, I'd recommend Skim. Good AppleScript support. I'd probably structure it like this (especially if the document might be large):
set DOIFound to false
tell application "Skim"
set pp to pages of document 1
repeat with p in pp
set t to text of p
--look for DOI and set DOIFound to true
if DOIFound then exit repeat--if it's not found then use url?
end repeat
end tell
I'm assuming a DOI would always exist on one page (not spread out to between two). Looks like they are invariably (?) on the first page of an article, which would make this quick of course, even with a large doc.
[edit]
Another way would be to get the Xpdf OSX binaries from http://www.foolabs.com/xpdf/download.html and use pdftotext in the command line (just tested this; it works well) and parse the text using AppleScript. If you want to stay in AppleScript, you can do something like:
do shell script "path/to/pdftotext 'path/to/pdf/file.pdf'"
which would output a file in the same directory with a txt file extension -- you parse that for DOI.
Have you tried it with pdfgrep? It works really well in commmandline
pdfgrep -n --max-count 1 --include "*.pdf" "DOI"
i have no idea to build an apple script though, but i would be interested in one also. so that if i drop a pdf into that folder it just automatically extracts the DOI and renames the file with the DOI in the filename.

Regexp pattern matching IP and UserAgent in an Huge File

I have a huge log file that has a structure like this:
ip=X.X.X.X
userAgent=Firefox
-----
Referer=hxxp://www.bla.org
I want to create a custom output like this:
ip:userAgent
for ex:
X.X.X.X:Firefox
and the pattern will ignore lines which don't start with ip= and userAgent=. (these two must form a pair as i mentioned above.)
I am a newbie administrator and our client needs a sorted file immediately.
Any help will be wonderful.
Thanks.
^ip=(\d+(?:\.\d+){3})[\r\n]+userAgent=(.+)$
Apply in global + multiline mode.
Group 1 will contain the IP, group 2 will contain the user agent string.
Edit: The above expression can be simplified a bit, we can remove the IP address format checking - assuming that there will be nothing but real IP addresses in the log file:
^ip=(\d+\.?)+[\r\n]+userAgent=(.+)$
You can use:
^ip=((?:[0-9]{1,3}\.){3}[0-9]{1,3})$
And
^userAgent=(.*)$
Get the group 1 for both and you will have the desired data.
give it a try (this is in no way robust if there are lines where your log file differs from the example snippet above):
sed -n -e '/^ip=/ {s///
N
s/\nuserAgent=/:/
p
}' HugeFile > customoutput