Regular Expression to select file paths from list of URL

Regular Expression to select file paths from list of URL - regex

I have a list of URLs in different format that were extracted from a random website:
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png
http://www.boston.com/weather?p1=BGMenu_SubnavBostonGlobe.com
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--full.png
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--bug.png
https://www.bostonglobe.com
https://www.bostonglobe.com
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
http://www.boston.com/section/cars?s_campaign=bg:hp:mainnav:cars
http://realestate.boston.com?s_campaign=bg:hp:mainnav:realestate
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
They all are in different format (optional http/https/www). I need to filter it to get any kind of "downloadable" content such as *jpg, *png, *html, etc.
Expected output:
/bg-images/png/search-magnifying-glass.png
/bg-images/png/search-magnifying-glass.png
/bg-images/png/bg-logo--full.png
/bg-images/png/bg-logo--bug.png
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking (not sure about these yet just in case)
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
this is my first time trying to write regex, and I came up with something like that:
(https?/\/)?(www\.)?[-a-zA-Z0-9#:;%._\+~\/#=]{2,256}\.[a-z]{2,4}a{0,1}\b([-a-zA-Z0-9#:;!%_\+.,~#?&//=]*)
which outputs a lots of trash lines. Any advice?

Since your sample Input_file is having space at last of the lines so I am using sub to remove those spaces, in case they are not there then you could remove it. Could you please try following and let me know if this helps you.
awk '{sub(/ *$/,"")}
(/^http/||/^https/||/^www/||/^\//) && \
(/.*png$/||/.*html$/||/.*jpg$/||/BGHeader_SmartBar_Breaking$/)
' Input_file

Instead of fetching some questionable URL from some questionable feed, you need to manually check them, because URL in general, DO NOT contain information about it's content. Many storage services uses ID to identify image, not names with extensions. But headers do contain this information:
How to get content type of a web address?
So as to what is downloadable? Everything. I mean literaly everything you see is downloadable. For example, for images content types will be something like these:
image/gif, image/png, image/jpeg, image/bmp, image/webp
For audio/video:
audio/midi, audio/mpeg, audio/webm, audio/ogg, audio/wav
Partially full list can be found here: http://htmlbook.ru/html/value/mime
As to solution - just sniff every link in multiple IO threads. This way you also will be able to filter those which need some authentication, were expired or invalid in first place. Usually its pretty cheap requests.

Related

How to create a scipt that will pull urls from one file and put it into another?

I've been searching and I can't seem to find even just a simple grep code to do what I want. I want to take a url such as r2---sn-vgqs7nes.googlevideo.com, but not r3---sn-2xouxaxq5u5-5cxs.googlevideo.com and put them into a seperate file. Everything between r2---sn- and .googlevideo.com changes. A few examples of the varients:
r2---sn-vgqs7nes.googlevideo.com
r4---sn-ab5l6n67.googlevideo.com
r4---sn-5hnednes.googlevideo.com
r12---sn-ab5l6nsz.googlevideo.com
r6---sn-a5mlrn7d.googlevideo.com
r3---sn-vgqsrn76.googlevideo.com
r6---sn-p5qlsne7.googlevideo.com
r2---sn-qxo7snel.googlevideo.com
r4---sn-q4f7sn7z.googlevideo.com
r1---sn-o097znez.googlevideo.com
r6---sn-q4f7sn7e.googlevideo.com
The characters between sn-(randomizes).googlevideo.com
Also, r(number) goes up to r20. Basically, I want to extract them from a log file which constanty updates and input into one that doesn't so, I can later use them. From lets say /opt/var/log/messages to /opt/var/log/list. Another thing I'd like to also do is check to make sure the url doesn't already exist before it inputs it. Thanks in advance for any help.

#john-goofy The urls go from r1 to r20 for each variant. The urls such as these r3---sn-(2xouxaxq5u5-5cxs).googlevideo.com don't need to be collected. These variants of urls in parentheses is important not to be collected because blocking those blocks the videos entirely. Those also go from r1-r20, but the part in parentheses doesn't change besides this part in parentheses, but only one letter sn-2xouxaxq5u5-(5cxs).googlevideo.com. Which So, my desired output would be this:
Not collected:
- (r1-20) ---sn-2xouxaxq5u5-5cxs.googlevideo.com
- (r1-20) ---sn-2xouxaxq5u5-5cxe.googlevideo.com
- (r1-20) ---sn-2xouxaxq5u5-5cx?.googlevideo.com
- The third one I forget the letter.
- manifest.googlevideo.com
Collected:
Everything else such as the ones in my OP. I already have a few thousand collected, but it takes way too long manually doing each one.
(Blocking all these gets rid of youtube ads for the most part. There's some I think included in the above urls, but blocking them blocks everything.)
And it would all be inputted from /opt/var/log/messages to /opt/var/log/list

Get spamassassin to drop emails containing a specific REGEX in attached filenames

newbie asking first question :)
I'm running a mail server (Ubuntu/Postfix/Dovecot) with SpamAssassin. Most of the known spam is flagged (RBLs, and obvious UCE) except for this particular malspam in attached zip files like "order_info_654321.zip", "paymet_document_123456.zip", and so on, when it doesn't fit any other SA rules. I'd like to procure a rule which drops the matching offenders into oblivion.
After fiddling with regex101.com, I've come up with an expression that matches these patterns exclusively:
/\w+[_][0-9]{6}.zip$/img
Question is... How to format it all, get it to work, and where to put it? So far, I edited /etc/spamassassin/local.cf, added this to the bottom, and restarted:
mimeheader TROJAN_ATTACHED Content-Type =~ /\w+[_][0-9]{6}.zip$/img
describe ZIP_ATTACHED email contains a zip trojan attachment
score TROJAN_ATTACHED 99.
But it doesn't seem to do the magic. Where else can I look for this?
Thank you all,
Keijo.-

You have a wrong regex. You do not need a $ char at the end, because filename strings are not necessarily at the end of the Content-Type header. Instead, you can use a word boundary \b anchor. In my rules, I have the following, and it perfectly works:
mimeheader MIME_FAIL Content-Type =~ /\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh|reg)\b/i
describe MIME_FAIL Blacklisted file extension detected
score MIME_FAIL 5

First up, SA doesn't drop e-mails by default, but it can score them so high on spam content that they don't show up to anyone's inbox. Second, the "ingredients" I started with were incorrect, plus messed up with SA ability to function at all.
This actually did the trick when added into/etc/spamassassin/local.cf:
full TROJAN_ZIPUNDS /\w*[_][\d]{1,6}\.zip/img
score TROJAN_ZIPUNDS 99
describe TROJAN_ZIPUNDS RM zip attached trojan underscore
Even though these spammers altered from zip to rar, to underscores to dashes, different filenames, and so on, creating rules to counter them became simple after succeeding with the first one. Here's what I added too:
full TROJAN_RARDASH /\w*[-][\d]{1,6}\.rar/img
score TROJAN_RARDASH 99
describe TROJAN_RARDASH RM rar attached trojan dash
Also, as first described, I needed to specifically block certain zip file names which soon morphed to rar and dashes, so, morphing the regex and appending as a rule triad to spamassassin's local.cf (and restarting) is currently holding, until next spam wave :-)
Finally, this is a very very blunt workaround, so anyone with expertise on the subject is more than welcome to chime in.

You are using the wrong mime header to check for the filename. Use this instead:
mimeheader TROJAN_ATTACHED Content-Disposition =~ /\w+[_][0-9]{6}.zip/img
Also make sure you have the MimeHeader plugin loaded.
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

Yahoo Pipes Using Regex to change link

Hi I am pretty new to regex I can do some basic functions but having trouble with this. I need to change the link in the rss feed.
I have a url like this:
http://mysite.test/Search/PropDetail.aspx?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
and want to change it to updated site:
http://mysite.test/PropertyDetail/?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
Where only thing changed is from /Search/PropDetail.aspx
to /PropertyDetail/
I don't have access to the orginal rss feed or I would change it there so I have to use pipes. Please help, Thanks!

Use the regex control.
In it, specify the DOM address of the node containing your link (prefixed by "item.") within the "In" field. For the "replace" field type
(.*)//Search//PropDetail/.aspx
and in the "with" field type use:
$1//PropertyDetail//.*
I've 'escaped' the '/' character in the with field. However, I'm not sure you need to do this except before the '.*' Some trial and error may be needed.
Hopefully this will achieve the result you want.

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.

Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.

Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe

Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.

I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.

I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.

One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.

I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";

The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.

Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.

I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression to select file paths from list of URL - regex

Related

How to create a scipt that will pull urls from one file and put it into another?

Get spamassassin to drop emails containing a specific REGEX in attached filenames

Yahoo Pipes Using Regex to change link

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

Use cases for regular expression find/replace

Categories

Resources