Youtube-dl - Playlist download - Is it possible to skip videos that contain a specific word?

Youtube-dl - Playlist download - Is it possible to skip videos that contain a specific word? - youtube-dl

I am trying to download a very long playlist, about 500+ vods, from twitch. I would like to skip certain files that contain a specific word in the title. If its possible, how?

Yes.
--reject-title
U can use string or regex.
Another, more expanded option is
--match-filter
type youtube-dl -h sometimes

Related

Regular Expression to select file paths from list of URL

I have a list of URLs in different format that were extracted from a random website:
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png
http://www.boston.com/weather?p1=BGMenu_SubnavBostonGlobe.com
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--full.png
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--bug.png
https://www.bostonglobe.com
https://www.bostonglobe.com
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
http://www.boston.com/section/cars?s_campaign=bg:hp:mainnav:cars
http://realestate.boston.com?s_campaign=bg:hp:mainnav:realestate
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
They all are in different format (optional http/https/www). I need to filter it to get any kind of "downloadable" content such as *jpg, *png, *html, etc.
Expected output:
/bg-images/png/search-magnifying-glass.png
/bg-images/png/search-magnifying-glass.png
/bg-images/png/bg-logo--full.png
/bg-images/png/bg-logo--bug.png
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking (not sure about these yet just in case)
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
this is my first time trying to write regex, and I came up with something like that:
(https?/\/)?(www\.)?[-a-zA-Z0-9#:;%._\+~\/#=]{2,256}\.[a-z]{2,4}a{0,1}\b([-a-zA-Z0-9#:;!%_\+.,~#?&//=]*)
which outputs a lots of trash lines. Any advice?

Since your sample Input_file is having space at last of the lines so I am using sub to remove those spaces, in case they are not there then you could remove it. Could you please try following and let me know if this helps you.
awk '{sub(/ *$/,"")}
(/^http/||/^https/||/^www/||/^\//) && \
(/.*png$/||/.*html$/||/.*jpg$/||/BGHeader_SmartBar_Breaking$/)
' Input_file

Instead of fetching some questionable URL from some questionable feed, you need to manually check them, because URL in general, DO NOT contain information about it's content. Many storage services uses ID to identify image, not names with extensions. But headers do contain this information:
How to get content type of a web address?
So as to what is downloadable? Everything. I mean literaly everything you see is downloadable. For example, for images content types will be something like these:
image/gif, image/png, image/jpeg, image/bmp, image/webp
For audio/video:
audio/midi, audio/mpeg, audio/webm, audio/ogg, audio/wav
Partially full list can be found here: http://htmlbook.ru/html/value/mime
As to solution - just sniff every link in multiple IO threads. This way you also will be able to filter those which need some authentication, were expired or invalid in first place. Usually its pretty cheap requests.

How to create a scipt that will pull urls from one file and put it into another?

I've been searching and I can't seem to find even just a simple grep code to do what I want. I want to take a url such as r2---sn-vgqs7nes.googlevideo.com, but not r3---sn-2xouxaxq5u5-5cxs.googlevideo.com and put them into a seperate file. Everything between r2---sn- and .googlevideo.com changes. A few examples of the varients:
r2---sn-vgqs7nes.googlevideo.com
r4---sn-ab5l6n67.googlevideo.com
r4---sn-5hnednes.googlevideo.com
r12---sn-ab5l6nsz.googlevideo.com
r6---sn-a5mlrn7d.googlevideo.com
r3---sn-vgqsrn76.googlevideo.com
r6---sn-p5qlsne7.googlevideo.com
r2---sn-qxo7snel.googlevideo.com
r4---sn-q4f7sn7z.googlevideo.com
r1---sn-o097znez.googlevideo.com
r6---sn-q4f7sn7e.googlevideo.com
The characters between sn-(randomizes).googlevideo.com
Also, r(number) goes up to r20. Basically, I want to extract them from a log file which constanty updates and input into one that doesn't so, I can later use them. From lets say /opt/var/log/messages to /opt/var/log/list. Another thing I'd like to also do is check to make sure the url doesn't already exist before it inputs it. Thanks in advance for any help.

#john-goofy The urls go from r1 to r20 for each variant. The urls such as these r3---sn-(2xouxaxq5u5-5cxs).googlevideo.com don't need to be collected. These variants of urls in parentheses is important not to be collected because blocking those blocks the videos entirely. Those also go from r1-r20, but the part in parentheses doesn't change besides this part in parentheses, but only one letter sn-2xouxaxq5u5-(5cxs).googlevideo.com. Which So, my desired output would be this:
Not collected:
- (r1-20) ---sn-2xouxaxq5u5-5cxs.googlevideo.com
- (r1-20) ---sn-2xouxaxq5u5-5cxe.googlevideo.com
- (r1-20) ---sn-2xouxaxq5u5-5cx?.googlevideo.com
- The third one I forget the letter.
- manifest.googlevideo.com
Collected:
Everything else such as the ones in my OP. I already have a few thousand collected, but it takes way too long manually doing each one.
(Blocking all these gets rid of youtube ads for the most part. There's some I think included in the above urls, but blocking them blocks everything.)
And it would all be inputted from /opt/var/log/messages to /opt/var/log/list

Get spamassassin to drop emails containing a specific REGEX in attached filenames

newbie asking first question :)
I'm running a mail server (Ubuntu/Postfix/Dovecot) with SpamAssassin. Most of the known spam is flagged (RBLs, and obvious UCE) except for this particular malspam in attached zip files like "order_info_654321.zip", "paymet_document_123456.zip", and so on, when it doesn't fit any other SA rules. I'd like to procure a rule which drops the matching offenders into oblivion.
After fiddling with regex101.com, I've come up with an expression that matches these patterns exclusively:
/\w+[_][0-9]{6}.zip$/img
Question is... How to format it all, get it to work, and where to put it? So far, I edited /etc/spamassassin/local.cf, added this to the bottom, and restarted:
mimeheader TROJAN_ATTACHED Content-Type =~ /\w+[_][0-9]{6}.zip$/img
describe ZIP_ATTACHED email contains a zip trojan attachment
score TROJAN_ATTACHED 99.
But it doesn't seem to do the magic. Where else can I look for this?
Thank you all,
Keijo.-

You have a wrong regex. You do not need a $ char at the end, because filename strings are not necessarily at the end of the Content-Type header. Instead, you can use a word boundary \b anchor. In my rules, I have the following, and it perfectly works:
mimeheader MIME_FAIL Content-Type =~ /\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh|reg)\b/i
describe MIME_FAIL Blacklisted file extension detected
score MIME_FAIL 5

First up, SA doesn't drop e-mails by default, but it can score them so high on spam content that they don't show up to anyone's inbox. Second, the "ingredients" I started with were incorrect, plus messed up with SA ability to function at all.
This actually did the trick when added into/etc/spamassassin/local.cf:
full TROJAN_ZIPUNDS /\w*[_][\d]{1,6}\.zip/img
score TROJAN_ZIPUNDS 99
describe TROJAN_ZIPUNDS RM zip attached trojan underscore
Even though these spammers altered from zip to rar, to underscores to dashes, different filenames, and so on, creating rules to counter them became simple after succeeding with the first one. Here's what I added too:
full TROJAN_RARDASH /\w*[-][\d]{1,6}\.rar/img
score TROJAN_RARDASH 99
describe TROJAN_RARDASH RM rar attached trojan dash
Also, as first described, I needed to specifically block certain zip file names which soon morphed to rar and dashes, so, morphing the regex and appending as a rule triad to spamassassin's local.cf (and restarting) is currently holding, until next spam wave :-)
Finally, this is a very very blunt workaround, so anyone with expertise on the subject is more than welcome to chime in.

You are using the wrong mime header to check for the filename. Use this instead:
mimeheader TROJAN_ATTACHED Content-Disposition =~ /\w+[_][0-9]{6}.zip/img
Also make sure you have the MimeHeader plugin loaded.
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

Very basic image renaming with regex

I spent most of yesterday putting together a collection of regular expressions to convert all my image names and paths to lower case. Today, I processed a folder full of files and was surprised to discover that many image names are still capitalized.
So I decided to try it one step at a time, first renaming .jpg's, then .gif's, .png's, etc.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors. The following regex works perfectly for jpg's, with one major flaw - it deletes the extension...
([\w/-]+)\.jpe?g
\L\1
In other words, it changes South-America.jpg to south-america.
How can I change it so that it retains the file extension? I assume I can then just change it to...
([\w/-]+)\.png
\L\1
...to process png's, etc.

([\w\/-]+)(\.jpe?g)
and replace with \L\1\2
its deleting your extension because you are never saving it in a matchgroup.

You could perhaps capture the extension too?
([\w/-]+)(\.jpe?g)
\L\1\2
And I think you should be able to use something like this for all the files:
([\w/-]+)(\.[^.]+$)
\L\1\2
Or if you specifically want to convert those jpegs, pngs and gifs:
([\w/-]+)(\.(?:jpe?g|gif|png))
\L\1\2

If it's okay for the extension to become lowercase as well, you could just do
^(.*)$
\L\1
As long as you're certain that all lines contain file names.
If you want to process only certain file formats, use
^(.*\.(jpe?g|png|gif))$
\L\1

Using regex to eliminate chunks in a file (categorized events in iCal file)

I have one .ics file from which I would like to create individual new .ics files depending on the event categories (I can't get egroupware to export only events of one category, I want to create new calendars depending on category). My intended approach is to repeatedly eliminate all events but those of one category and then save the file using EditPad Lite 7 (Windows).
I am struggling to get the regular expression right. .+? is still too greedy and negating the string (e.g. to eliminate all but events from one category) doesn't work either.
Sample
BEGIN:VEVENT
DESCRIPTION:Event 2
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 3
CATEGORIES:Sports
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 4
END:VEVENT
The regex BEGIN:VEVENT.+?CATEGORIES:Sports.+?END:VEVENT should only match sports events but it catches everything from the first BEGINto the first ENDfollowing the category.
Edit: negating doesn't work either: BEGIN:VEVENT.+?((?!CATEGORIES:Sports).).+?END:VEVENT.
What am I missing? Any pointers are highly appreciated.

I guess newlines are removed or ignored, because your regex does not care about them.
I only have a correction to the match after CATEGORIES
BEGIN:VEVENT.+?CATEGORIES:Sports.*?END:VEVENT
^
Zero or more
The first part of your regex looks good, maybe the regex engine in EditPad is not so good.
Try it with a different editor or scripting language (like Eclipse or perl or Notepad+ or Notepad2)
You could split the input and then grep the matching Sports events
#sportevents = grep /Sports/, split /END:VEVENT/, $input
map $_.="END:VEVENT", #sportevents
This was perl, maybe you can launch a script from EditPad to do it.
The second line just restores the END:VEVENT that was stripped during split.

OK. Solved it. I found something here which can be used to split ics files. I tweaked it to use the category rather than the summary in the file name and then merged the individually generated files according to category. I added the usual ics header and footer to all files and, voilà, I had individual calendar files.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Youtube-dl - Playlist download - Is it possible to skip videos that contain a specific word? - youtube-dl

I am trying to download a very long playlist, about 500+ vods, from twitch. I would like to skip certain files that contain a specific word in the title. If its possible, how?

Yes. --reject-title U can use string or regex. Another, more expanded option is --match-filter type youtube-dl -h sometimes

Related

Regular Expression to select file paths from list of URL

How to create a scipt that will pull urls from one file and put it into another?

Get spamassassin to drop emails containing a specific REGEX in attached filenames

Very basic image renaming with regex

Using regex to eliminate chunks in a file (categorized events in iCal file)

Categories

Resources