Removing Different URLs with Regex - regex

I am looking to remove a ton of bad spam URL links from my forums using regex in either grep or vim and subsequently using find/replace commands. I am looking for a way to select just the bad URLs to do that.
All of the URLs are different and are preceeded by \n________\n. (Thats 8 underscores)
Here is an example of one of the URLs:
\n________\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
So basically I was trying to use the \n... and the [/URL] as boundaries to select that and everything inbetween. What I came up with is this:
[\\]n[_][_][_][_][_][_][_][_][\\]n.*\[\/URL\]]
Using that does not correctly close the search and selects pretty much everything. I very am new at this and appreciate any insight. Thanks.

Assuming GNU ERE, this should work:
\\n_{8}\\n\s\[URL=(.*)].*\[/URL]
RegexBuddy seems to agree with me:
That said,
> grep -E \\n_{8}\\n\s\[URL=(.*)].*\[/URL] test.txt
doesn't work on my system (Cygwin with GNU grep 2.6.3; test.txt's contents are shown in the screenshot above).

If you want to give sed a chance following will do the job:
sed 's/^.*\(\[URL.*\)$/\1/' file.txt
PS: You can do same :s/^.*\(\[URL.*\)$/\1/ in your vi session as well.
OUTPUT
For the file.txt that contains:
\n__\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
It produces:
[URL=http://boxvaporizers.com]Box Vaporizers[/URL]

In Vim this should remove all lines that match the pattern:
:g/\\n\%(\\_\)\{8}\\n \[URL=.\{-}\/URL\]/d
That pattern matches the sample text taken literally, all in one line.

I was actually able to do this in Microsoft Word using the following:
[\\]n_{8}[\\]n?*/URL\]
Thank you for all the input, couldn't have done it without the help!

Related

Regex removing bold markdown from inside codeblock only

I'm editing in bulk some markdown files to be compliant with mkdocs syntax (material theme).
My previous documentation software accepted bold inside codeblock, but I discover now it's far from standard.
I've more than 10k codeblocks in this documentation, with more than 300 md files in nested directories, and most of them has ** in order to bold some word.
To be precise I should make any CodeBlock from this:
this is a **code block** with some commands
```my_lexer
enable
configure **terminal**
interface **g0/0**
```
to this
this is a **code block** with some commands
```my_lexer
enable
configure terminal
interface g0/0
```
The fun parts:
there are bold words in the rest of the document I would like to maintain (outside code block)
not every row of the code block has bold in it
not even every code block has necessarily bold in it
Now I'm using visual studio code with the substitute in files, and most of the easy regex I did for the porting is working. But it's not a perfect regex syntax (for examples, groups are denoted with $1 instead of \1 and maybe some other differences I don't know about).
But I accept other software (regex flavors) too if they are more regex compliant and accept 'replace in all files and subdirectories' (like notepad++, atom, etc..)
Sadly, I don't even know how to start something so complicated.
The most advanced I did is this: https://regex101.com/r/vRnkop/1 (there is also the text i'm using to test it)
(^```.*\n)(.*?\*\*(.*?)\*\*.*$\n)*
I hardly think this is a good start to do that!
Thanks
Visual Studio is not my forté but I did read you should be able to use PCRE2 regex syntax. Therefor try to substitute the following pattern with an empty string:
\*\*(?=(((?!^```).)*^```)(?:(?1){2})*(?2)$)
See an online demo. The pattern seems a bit rocky and maybe someone else knows a much simpler pattern. However I did wanted to make sure this would both leave italic alone and would make bold+italic to italic. Note that . matches newline here.
If you have unix tools like sed. it is quite easy:
sed '/^```my_lexer/,/^```/ s/\*\*//g' orig.md >new.md
/regex1/,/regex2/ cmd looks for a group of lines where the first line matches the first regex and the final line matches the second regex, and then runs cmd on each of them. This limits the replacements to the relevant sections of the file.
s/\*\*//g does search and replace (I have assumed any instance of ** should be deleted
Some versions of sed allow "in-place" editing with -i. For example, to edit file.md and keep original version as file.md.orig:
sed -i.orig '...' file.md
and you can edit multiple files with something like:
find -name '*.md' -exec sed -i.orig '...' \{} \+

Textmate Regex Issue

I have a very large .CSV document with text I need removing. The data looks like this
774431994&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431994
774431996&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431996
774431998&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431998
774432000&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432000
774432003&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432003
774432006&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432006
774432009&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432009
I'm using the following Regex which is working on http://regexr.com/3a6oa
/.{128}(?=webimg).{10}/g
It just doesn't seem to work with Textmate Search. Does anyone know why? I need to select all of this junk and replace it with nothing, the numbers are unique each time.
Thanks very much
Why are you using a lookahead in your pattern? Just use: /.{128}webimg.{10}/g
Why are you using Textmate search at all? I'd need to know more context of the problem to say for sure, but I bet a simple sed command could just be used instead:
sed -i "webimg/d" ./filename.csv

Unix pattern matching using vim

Dealing with random files, which has patterns like
Jon Smith-db/his-wife.db/his-keeds.db
Jon Smith-db/his-wife.db/his-siblings/his-k1ds.db
....
....
I need to replace last string his-keeds.db and similar typos to blank, so my attempt is
:1,$s/\/.+\.db$//g
but doesn't work. I was able to do it using awk and perl but failed doing so in vim inbuilt editor. Can anyone help?
this would do the job:
:%s#/[^/]*\.db$#/#
if you don't want the ending slash:
:%s#/[^/]*\.db$##
Try
:g/his-keeds.db$/s///g
instead

underscore to camelCase RegEx

Our standards have changed and I want to do a 'find and replace' in say Dreamweaver(it allows for RegEx or we just got Visual Studio 2010, if it allows for searching by RegEx) for all the underscores and camelCase them.
What would be the RegEx to do that?
RegEx baffles me. I definitely need to do some studying.
Thanks in advance!
Update: A little more info - I'm searching within my html,aspx,cfm or css documents for any string that contains an underscore and replacing it with the following letter capitalized.
I had this problem, but I need to also handle converting fields like gap_in_cover_period_d_last_5_yr into gapInCoverPeriodDLast and found 9 out of 10 other sed expressions, don't like 1 letter words or numbers.
So to answer the question, use sed. sed -s is the equivalent to using the :s command in vim. So the example below is a command (ie sed -s/.../gc
This seemed to work, although I did have to run it twice (word_a_word will become wordA_word on the first pass and wordAWord on the second. sed backward commands, are just too magical for my muggle blood):-
s/\([A-Za-z0-9]\+\)_\([0-9a-z]\)/\1\U\2/gc
I recently had to approach a similar situation you asked about. Here is a regex I've been using in VIM which does the job for me:
%s/_\([a-zA-Z]\)/\u\1/g
As an example:
this_is_a_test becomes thisIsATest
I don't think there is a good way to do this purely with regex. Searching for _ characters is easy, something like ._. should work to find an _ with something on either side, but you need a more powerful scripting system to change the case of the character following the _. I suggest perl :)
I have a solution in PHP
preg_replace("/(_)(.)/e", "strtoupper('\\2')", $str)
There may be a more elegant selector criteria but I wanted to keep it simple.

perl regex problem -- $amp in yahoo finance page

I found an old perl hack on the O'Reilly site http://oreilly.com/pub/h/1041 and decided to check it out. After a little fiddling around it started to run but the regex are out of date.
Here is the question: with this
/<a href="\/q\/op\?s=(.*?)\&m=(.*?)">/
as the first line of regex, what needs to be modified to make the regex function again? The following are snippets from
http://finance.yahoo.com/q/op?s=FISV
<a href="/q/op?s=FISV&k=55.000000">
and
<a href="/q/os?s=FISV&m=2011-04-15">
.
The original hack is dated 2004 and option symbols looked like this (FQVAH or FQVFF) back then instead of fisv110416c00060000 for a call option and fisv110416p00090000 for a put option. First thing I did to get it going was to modify all instances of $url to $curl because until the name was changed the symbol was not being passed to yahoo for lookup. The &amp is giving me the most trouble. If this is found to run without modification I would be very surprised and would very much like to know what system and perl -V is installed. SLES 10 and perl 5.8.0 is what I am currently using.
Any suggestions would be helpful. It could be a useful script to anyone who is serious about protecting themselves from a falling equity market.
Thanks,
robm
I'm not /100%/ sure what you're asking, but if I'm understanding, you want a regex that will capture "fisv110416c00060000" and tell you the first few letters, whether it's a call or a put, and the amount?
If so, you're looking for something like:
/([a-z]+)(\d+)([cp])(\d+)/
That should capture the following for the first example
$1 = "fisv"
$2 = 110416
$3 = c
$4 = 00060000
The original regex was very specific to that html string. You can include the beginning bits of it if you need to use it to check that the entire string is there as well. Of course, make your regex as tight as possible to avoid over-matches and wasted time pattern matching. I'm just not sure the exact pattern you're trying to match (ie: is it always "fisv"?).
You should either first unescape the html, this would turn the & into a &, or just change the regex, like this:
/<a href="\/q\/os\?s=(.*?)\&(?:amp;)?m=(.*?)">/
To match both types of urls:
/<a href="\/q\/o[ps]\?s=(.*?)\&(?:amp;)?[mk]=(.*?)">/