grep complete resource url within a file - regex

I have to search and extract within a file addresses like these:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.it-dz.5373092572.05373092574.12739786322/enclosure.m4v
They are 38 links with only the last serie of digit which change.
I tried with this regexp:
grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.m4v]'
it extract all the urls present in the file which point to an m4v file but not the complete url it get a partial url as follow:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.
Where am I wrong?
I can't figure out why it happens.
Thanks a lot for your effort.

Your regex and your extracted filename do not match. The filename that you list does not begin with:
http://ex
Which your regex requires. you could change your regex to something more like this which would match your URL:
'http://(?:[a-z0-9+-]+/)*[a-z0-9+-]+\.m4v'

Sorry Jonathan it was a typing mistake while I posted in my regex was correctly used dei and not ex as written.
But the problem persisted.
Marc opened my mind.
I knew how the address starts so I tried with
grep -io 'http://dei/.m4v'
no success :-(
fedorqui gave the last hint, maybe the problem was a dot
so I tried
grep -io 'http://deimos./.m4v' :-D
and it did the trick!
Now I have the file to give to wget to automate multiple file downloads without proprietary softwares needing.
The files are podcasts of juridic lessons released free as in freedom but only in an easy way for who would buy Apple or Microsoft (iTunes).
Now I have the file to give to wget to automate multiples file downloads without soiling my system with emulators and proprietary software.
Thanks to all indeed!!

Related

Regex removing bold markdown from inside codeblock only

I'm editing in bulk some markdown files to be compliant with mkdocs syntax (material theme).
My previous documentation software accepted bold inside codeblock, but I discover now it's far from standard.
I've more than 10k codeblocks in this documentation, with more than 300 md files in nested directories, and most of them has ** in order to bold some word.
To be precise I should make any CodeBlock from this:
this is a **code block** with some commands
```my_lexer
enable
configure **terminal**
interface **g0/0**
```
to this
this is a **code block** with some commands
```my_lexer
enable
configure terminal
interface g0/0
```
The fun parts:
there are bold words in the rest of the document I would like to maintain (outside code block)
not every row of the code block has bold in it
not even every code block has necessarily bold in it
Now I'm using visual studio code with the substitute in files, and most of the easy regex I did for the porting is working. But it's not a perfect regex syntax (for examples, groups are denoted with $1 instead of \1 and maybe some other differences I don't know about).
But I accept other software (regex flavors) too if they are more regex compliant and accept 'replace in all files and subdirectories' (like notepad++, atom, etc..)
Sadly, I don't even know how to start something so complicated.
The most advanced I did is this: https://regex101.com/r/vRnkop/1 (there is also the text i'm using to test it)
(^```.*\n)(.*?\*\*(.*?)\*\*.*$\n)*
I hardly think this is a good start to do that!
Thanks
Visual Studio is not my forté but I did read you should be able to use PCRE2 regex syntax. Therefor try to substitute the following pattern with an empty string:
\*\*(?=(((?!^```).)*^```)(?:(?1){2})*(?2)$)
See an online demo. The pattern seems a bit rocky and maybe someone else knows a much simpler pattern. However I did wanted to make sure this would both leave italic alone and would make bold+italic to italic. Note that . matches newline here.
If you have unix tools like sed. it is quite easy:
sed '/^```my_lexer/,/^```/ s/\*\*//g' orig.md >new.md
/regex1/,/regex2/ cmd looks for a group of lines where the first line matches the first regex and the final line matches the second regex, and then runs cmd on each of them. This limits the replacements to the relevant sections of the file.
s/\*\*//g does search and replace (I have assumed any instance of ** should be deleted
Some versions of sed allow "in-place" editing with -i. For example, to edit file.md and keep original version as file.md.orig:
sed -i.orig '...' file.md
and you can edit multiple files with something like:
find -name '*.md' -exec sed -i.orig '...' \{} \+

Use Dreamweaver regex to add file extension

I have a project where I've exported an html file to be sanitized in preparation for a language translation. The problem is that the internal links do not have the ".html" extension. I've solved the problem of erasing the long file paths, but appending the remaining file is the problem.
The raw file path is:
href="https://oldsite.com/folder1/folder2/folder3/actualpage
I use this regex to find all instances of 'https://oldsite.com" and subfolders, adapting it to how many subfolders I have:
(https://oldsite.com)+/[a-zA-Z0-9]+/[a-zA-Z0-9]\w+/[a-zA-Z0-9]\w+/[a-zA-Z0-9]\w+/[a-zA-Z0-9]\w+
Leaving me with "href="actualpage"
The ideal result should be:
href="actualpage.html"
I've been researching this for hours and can't figure out how to append ".html" to the page.
I'm even open to an application or script that can automate this process.
Thanks in advance.
After some research and some tutorials, I found a regex that did the trick. After shortening the file paths to one level, I used the following:
In Dreamweaver:
Find:
href="(.*)" title=
Replace:
href="$1.html" title=
I performed a massive Find/Replace and was able to fix 1500 files in minutes. Regex is my jam!
I hope this helps other regex noobs like myself.

regex to remove hyperlinks

Input:
source http://www.emaxhealth.com/1275/misdiagnosing from here http://www.cancerresearchuk.org/about-cancer/type recounting her experiences and thoughts blog http://fty720.blogspot.com even carried the new name. She was far from home.
From the about input I want to remove the hyperlinks. Below is the regex that I am trying
http://[\w|\W|\d|\s]*(?=[ ])
This regex will encompass all characters,digits and whitespaces after encountering the word 'http' and will continue till first blank space.
Unfortunately, it is not working as expected. Please do help me find out my error.Thanks
Try this sed command
sed 's/http[^ ]\+//g' FileName
Output :
source from here recounting her experiences and thoughts blog even carried the new name. She was far from home.
To find the hyperlink use:
\b(https?)://[A-Z0-9+&##/%?=~_|$!:,.;-]*[A-Z0-9+&##/%=~_|$]
or:
If you want to find the html a tag use:
<a\b[^>]*>(.*?)</a>

Find file names using find command and regex, functioning improperly

We have a Samba server that is backing up to an S3 bucket. Come to find out that a large number of file names contain inappropriate characters and the AWS CLI won't allow the transfer of those files. Using the "worst offender" I build a quick regex check, tested in rubular against another file name to try and generate a list of files that need to be fixed:
([中文网页我们的团队孙é¹â€“¦]+)
The command I'm running is:
find . -regextype awk -regex ".*/([中文网页我们的团队孙é¹â€“¦]+)"
This brings back a small list of files that contain the above string, in order, not individual characters contained throughout the name. This leads me to believe that either my regextype is incorrect or something is wrong with the formatting of the list of characters. I've tried types emacs and egrep as they seem most similar to regex I've used outside of a Unix environment to no luck.
My test file name is: this-is-my€™s'-test-_ folder-name. which, according to my rubular tests, should be returned but isn't. Any help would be greatly appreciated.
Your regex .*/([中文网页我们的团队孙é¹â€“¦]+) expects one of the special characters after the slash and your test file doesn't start with one of these characters.
You might try something more like .*[中文网页我们的团队孙é¹â€“¦]+.* instead.

Removing Different URLs with Regex

I am looking to remove a ton of bad spam URL links from my forums using regex in either grep or vim and subsequently using find/replace commands. I am looking for a way to select just the bad URLs to do that.
All of the URLs are different and are preceeded by \n________\n. (Thats 8 underscores)
Here is an example of one of the URLs:
\n________\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
So basically I was trying to use the \n... and the [/URL] as boundaries to select that and everything inbetween. What I came up with is this:
[\\]n[_][_][_][_][_][_][_][_][\\]n.*\[\/URL\]]
Using that does not correctly close the search and selects pretty much everything. I very am new at this and appreciate any insight. Thanks.
Assuming GNU ERE, this should work:
\\n_{8}\\n\s\[URL=(.*)].*\[/URL]
RegexBuddy seems to agree with me:
That said,
> grep -E \\n_{8}\\n\s\[URL=(.*)].*\[/URL] test.txt
doesn't work on my system (Cygwin with GNU grep 2.6.3; test.txt's contents are shown in the screenshot above).
If you want to give sed a chance following will do the job:
sed 's/^.*\(\[URL.*\)$/\1/' file.txt
PS: You can do same :s/^.*\(\[URL.*\)$/\1/ in your vi session as well.
OUTPUT
For the file.txt that contains:
\n__\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
It produces:
[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
In Vim this should remove all lines that match the pattern:
:g/\\n\%(\\_\)\{8}\\n \[URL=.\{-}\/URL\]/d
That pattern matches the sample text taken literally, all in one line.
I was actually able to do this in Microsoft Word using the following:
[\\]n_{8}[\\]n?*/URL\]
Thank you for all the input, couldn't have done it without the help!