regex to strip out image urls? - regex

I need to separate out a bunch of image urls from a document in which the images are associated with names like this:
bellpepper = "http://images.com/bellpepper.jpg"
cabbage = "http://images.com/cabbage.jpg"
lettuce = "http://images.com/lettuce.jpg"
pumpkin = "http://images.com/pumpkin.jpg"
I assume I can detect the start of a link with:
/http:[^ ,]+/i
But how can I get all of the links separated from the document?
EDIT: To clarify the question: I just want to strip out the URLs from the file minus the variable name, equals sign and double quotes so I have a new file that is just a list of URLs, one per line.

Try this...
(http://)([a-zA-Z0-9\/\\.])*

If the format is constant, then this should work (python):
import re
s = """bellpepper = "http://images.com/bellpepper.jpg" (...) """
re.findall("\"(http://.+?)\"", s)
Note: this is not "find an image in a file" regexp, just an answer to the question :)

do you mean to say you have that kind of format in your document and you just want to get the http part? you can just split on the "=" delimiter without regex
$f = fopen("file","r");
if ($f){
while( !feof($f) ){
$line = fgets($f,4096);
$s = explode(" = ",$line);
$s = preg_replace("/\"/","",$s);
print $s[1];
}
fclose($f);
}
on the command line :
#php5 myscript.php > newfile.ext
if you are using other languages other than PHP, there are similar string splitting method you can use. eg Python/Perl's split(). please read your doc to find out

You may try this, if your tool supports positive lookbehind:
/(?<=")[^"\n]+/

Related

How to replace the periods of just URL(s) and/or email address(s) buried in text

I am using the great answer provided by D Greenberg in the stackoverflow q&a Python split text on sentences to split text into sentences. I would like help augmenting one part of it.
The overall code uses a bunch of regular expressions to recognize abbreviations, acronyms, websites, prefixes (Mr., Mrs., etc.) and other non-sentence endings and changes u'.' into u'<prd>'. All the u'.' that aren't changed must be periods that end sentences.
The re that recognizes websites only works for URLs of the form text.(com|org|gov...). It doesn't work for text1.text2.text3.(com|org|gov...). May I have some help in making this work?
I have edited the original code to just the relevant section:
def split_into_sentences(text):
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
websites = u"[.](com|net|org|io|gov)"
digits = u"([0-9])"
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
text = re.sub(digits + u"[.]" + digits,u"\\1<prd>\\2",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
I believe the following re will find a full URL or email address (I know there are more domains possible and I will augment if needed)
websites = ur"([\w#-]+[.])+(com|net|org|io|gov)"
What I can't figure out how to do is change the text = re.sub(websites,u"<prd>\\1",text) to accomplish what I want: in the portions of text that match the website pattern, change all of the u'.' into u'<prd>'
You may use your pattern to match all those substrings in question and perform a custom search and replace on each match using a lambda expression used as the second argument to re.sub:
result = re.sub(websites, lambda x: x.group().replace(u".", u"<prd>"),text)

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Excluding a file extension while parsing a CSV file

So I'm new to Perl and writing a script that would read through rows in a CSV file, and rename a directory of files associated with a certain column in that CSV file.
my $filename_formatted = "$row->[3]"."_"."$row->[4]"."_"."$row->[2]\n";
my $resume_id = $row->[1];
if (-e $resume_id){
rename($resume_id, $filename_formatted);
}
Basically, how could I format $resume_id to accept only the contents up to the file extension? The $row->[1] variable contains something like "resume_1231.pdf" or "resume_1231.doc". I basically want everything up to the .
I understand I would probably need a regex, but, I've never utilized it in Perl.
$formatted_resume_id = /($row->[1])?!\..*$/
I don't know.
I suppose you would want everything up to the final dot in the file name (so you would get the full name even if the filename contained dots).
Something like this should do it:
if ( $row->[1] =~ /(.*)\./ ) {
$formatted_resume_id = $1;
}
The $row->[1] variable contains something like "resume_1231.pdf" or "resume_1231.doc".
I basically want everything up to the .
Try with capturing group.
^([^.]*)
Live demo
OR using Lazy way.
^(.*?)\.
Sample code:
$mystring = "resume_1231.pdf";
if($mystring =~ m/^([^.]*)/) {
print "The file name is $1";
}
So the answer was apparently this,
my $resume_file = "bogus_filename.doc";
my ($name) = $resume_file =~ /(.+?)(\.[^.]*$|$)/;
my($ext) = $resume_file =~ /(\.[^.]+)$/;
This would account for any extra periods, as it only accepts up to the very last period.
I'm still a bit unsure as to how this works, so if anyone can break down the first regex, that would be great. I understand (.+?) but I'm lost as to how the second part of that regex means to not include the extension.

RegEx : Replace parts of dynamic strings

I have a string
IsNull(VSK1_DVal.RuntimeSUM,0),
I need to remove IsNull part, so the result would be
VSK1_DVal.RuntimeSUM,
I'm absolute new to RegEx, but it wouldn't be a problem, if not one thing :
VSK1 is dynamic part, can be any combination of A-Z,0-9 and any length. How to replace strings with RegEx? I use MSSQL 2k5, i think it uses general set of RegEx rules.
EDIT : I forgot to say, that I'm doing replacement in SSMS Query window's Replace Box (^H) - not building RegEx query
br
marius
here's a regex that should work:
[^(]+\(([^,]+),[^)]\)
Then use $1 capture group to extract the part that you need.
I did a sanity check in ruby:
orig = "IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /[^(]*\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => VSK1_DVal.RuntimeSUM,
It gets trickier if you have a prefix that you want to retain. Like if you have this:
"somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
In this case, you need someway to identify the start of the pattern. Maybe you can use '=' to identify the start of the pattern? If so, this should work:
orig = "somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /=\s*\w+\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
But then the case where you don't have an equals sign will fail. Maybe you can use 'IsNull' to identify the start of the pattern? If so, try this (note the '/i' representing case insensitive matching):
orig = "somestuff = isnull(VSK1_DVal.RuntimeSUM,0),"
regex = /IsNull\(([^,]+),[^)]\)/i
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
/IsNULL\((A-Z0-9+),0\)/
Then pick group match number 1.
Here's a very useful site: http://www.regexlib.com/RETester.aspx
They have a tester and a cheatsheet that are very useful for quick testing of this sort.
I tested the solution by Dave and it works fine except it also removes the trailing comma you wanted retained. Minor thing to fix.
Try this:
IsNULL\((.*,)0\)
You say in your question
I use MSSQL 2k5, i think it uses
general set of RegEx rules.
This is not true unless you enable CLR and compile and install an assembly. You can use its native pattern matching syntax and LIKE for this as below.
WITH T(C) AS
(
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,0),' UNION ALL
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,123465),' UNION ALL
SELECT 'No Match'
)
SELECT SUBSTRING(C,8,1+LEN(C)-8-CHARINDEX(',',REVERSE(C),2))
FROM T
WHERE C LIKE 'IsNull(%,_%),'