How to match and remove any line containing a specific string? - regex

I have a huge directory list of URLs from my Web site. Example:
/folder/folder2/folder3/page.htm
/folder/folder2/folder3/page2.htm
/folder/folder2/folder3/page3.htm
/folder/folder2/folder3/page4.htm
I want to clean this list of all items that have /folder2 in the path. I need a regular expression to perform a find and replace for everything that uses /folder2/ and delete those lines from my list. So find/replace it with blank.
Does anyone know what the proper regular expression for this would be? I should specify I am using Dreamweaver as my editor, which may use different regular expressions.

This expression will match the entire line such that the string "/folder2" occurs in it:
^.+?\/folder2/.+$
HTH.

In Python that would be:
import re
regex = re.compile('.*/folder2/.*')
f = open("filtered_file.txt", "w")
map(lambda x: f.write(x), filter(lambda x: not regex.match(x), open("input.txt")))
f.close()

Related

R Wildcard in the middle of an expression

I want to use the pattern expression in R to find files in my directory that match "ReportName*.HTML". Meaning that I only want to find files with certain file names and extensions, but there are dynamic characters between.
Here's an example: I want to find all reports that begin with "2016 Operations" but end with the extension ".HTML". Currently I am trying:
files.control <- dir(path, pattern="^2016 Operations*.HTML$")
Why doesn't this work? I like the one line of code; it's so simple.
The "ReportName*.HTML" syntax is called a glob and is supported in R via the following which will return a character vector of the current directory filenames starting with ReportName and ending with .HTML.
Sys.glob("ReportName*.HTML")
The R function glob2rx will translate globs to regular expressions so this does the same thing:
dir(pattern = glob2rx("ReportName*.HTML"))
We can discover the regular expression associated with a glob like this:
glob2rx("ReportName*.HTML")
## [1] "^ReportName.*\\.HTML$"
and you can find more information on regular expressions from within R via help using ?regex and more info at the links near the bottom of this page: https://code.google.com/archive/p/gsubfn/

Powershell with regex: Unable to find and replace ALL occurences of specified string in a set of data

I am new to regular expressions and stackoverflow. Any help would be greatly appreciated.
I am trying to remove unwanted data from a data set. The data is contained in a .csv file column with multiple cells, each cell containing data similar to this:
OSVDB #109124,OSVDB #109125,OSVDB #109126,OSVDB #109127,OSVDB #109128,OSVDB #109129,OSVDB #109130,OSVDB #109131,OSVDB #109132,OSVDB #109133,OSVDB #109134,OSVDB #109135,OSVDB #109136,OSVDB #109137,OSVDB #109138,OSVDB #109139,OSVDB #109140,OSVDB #109141,OSVDB #109142,OSVDB #109143,VMSA #2014-0012,OSVDB #102715,OSVDB #104972,OSVDB #106710,OSVDB #115364,IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
I want to replace the above data with each occurrence of the strings beginning "IAV...". So, the above cell would read:
IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
Below is a snippet of the script that imports the .csv and gets the column containing the data.
My regex, within powershell is:
$reg1 = '$1'
$reg2 = '(IAV[A|B]\s#[0-9]{4}-[A|B]-[0-9]{4}){1,}'
ForEach-Object {$_.IAVM = [regex]::replace($_.IAVM,$reg2,$reg1); $_}
The result is:
The entire cell contents posted above.
From my understanding {1,} at the end of the regex should return each occurrence of the string pattern, but I'm returning all contents of every cell containing my regex string.
Maybe instead of trying to pick out your string you just delete the stuff you don't want? Try something like:
$reg1=''
$reg2='((OSVDB|VMSA)\s#[M-S0-9-]{6,9}[,]?)'
You have .* in that regex at the very beginning. This will capture everything up to the last match of the pat that follows it. In your case I don't think you need that part anyway.
Also note that PowerShell has a handy -replace operator, so there's often no reason to use the static methods on the Regex type.

Finding a group of words using Regular Expressions

I am using python to get user input and then by using regular expressions I want to check for certain words. In this case I want to check how the user is feeling and then store it in a list. The problem is that when I print the list it is empty.
import re
phrase = raw_input("How are you feeling ")
phrase = phrase.lower()
feel=(re.findall(r^(?=.*\bsad\b)(?=.*\bhappy\b)(?=.*\bjoyful\b)(?=.*\bmad\b)(?=.*\bsad\b), phrase))
print feel
I'm not a python expert, but am fairly decent with regex. Why wouldn't you just use something like:
\b(happy|sad|joyful|mad)\b
Add chars to match
...(?=.*\bsad\b).*

How to make a regex to replace the value of a key in a json file

I want to make a regex so I can do a "Search/Replace"
over a json file with many object.
Every object has a key named "resource"
containing a URL.
Take a look at these examples:
"resource":"http://www.img/qwer/123/image.jpg"
"resource":"io.nl.info/221/elephant.gif"
"resource":"simgur.com/icon.png"
I want to make a regex to replace the whole url with
a string like this: img/filename.format.
This way, the result would be:
"resource":"img/image.jpg"
"resource":"img/elephant.gif"
"resource":"img/icon.png"
I'm just starting with regular expressions and I'm
completely lost. I was thinking that one valid idea would
be to write something starting with this pattern "resource":"
and ending with the last five characters. But I don't even know how to try
that.
How could I write the regular expression?
Thanks in advance!
Try this:
Find: "resource":\s*"[^"]+?([^\/"]+)"
Replace: "resource":"img/\1
Using [^"]+? ensures the match won't roll off the end of the current entry and gobble up too much input, and it's reluctant (with the added ?) so it gets the whole image file name (instead ofwhat the last character).
Edit:
I added optional whitespace after the key, which your pastebin has.
See a live demo of this regex with your pastebin.
Regex
.*\/
Debuggex Demo
This will find the text you want to replace. Replace it with img/ if you want to find the whole text you'll need to look for the following Regex:
("resource":").*\/
Debuggex Demo
Then replace with $1img/ this should give you group 1 and the img part.
Let me know if there are any questions
Note: I personally would just use objects since you have the JSON and parse it to a object then iterate over the objects and change each resource on each object independently rather than looking for a magic bullet
If your JSON is an array of objects containing resource field I would do it in 3 steps: convert to object, find resources and replace them, convert back to string (optional)
var tmp = JSON.parse('<your json>');
for (i = 0; i < tmp.length; ++i) {
for (e in tmp[i])
if (e == 'resource')
tmp[i][e] = tmp[i][e].replace(/.*(?=img\/.*\..*)/,'')
}
tmp = JSON.stringify(tmp);

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.