Get url from a string groovy - regex

I am working with a grails app. I need to extract only part of the url up to .com (or gov, edu, mil, org, net, etc.) from a string.
For example:
Input: https://stackoverflow.com/questions?=34354#es4 Output: https://stackoverflow.com/
Input: https://code.google.com/p/crawler4j/issues/detail?id=174 Output: https://code.google.com/
Can anyone suggest how it can be done? Also, if it can be done, I need to change https to http in the resulting string. Please help. Thanks.
Edit: I apologize to all the downvoters that I did not include the thing that I tried. This is what i tried:
URL url = new URL(website);
String webUrl = url.getprotocol()+"://"+url.getAuthority()
But I got the following error: MissingPropertyException occurred when processing request: [POST] /mypackage/resource/crawl

Something like this satisfies the 2 examples given:
def url = new URL('http://stackoverflow.com/questions?=34354#es4')
def result = 'http://' + url.host +'/'
assert result == 'http://stackoverflow.com/'
def url2 = new URL('https://code.google.com/p/crawler4j/issues/detail?id=174')
def result2 = 'http://' + url2.host +'/'
assert result2 == 'http://code.google.com/'
EDIT:
Of course you can abbreviate the concatenation with something like this:
def url = new URL('http://stackoverflow.com/questions?=34354#es4')
def result = "http://${url.host}/"
assert result == 'http://stackoverflow.com/'
def url2 = new URL('https://code.google.com/p/crawler4j/issues/detail?id=174')
def result2 = "http://${url2.host}/"
assert result2 == 'http://code.google.com/'

I found the error in my code as well. I mistyped getProtocol as getprotocol and it evaded my observation again and again. It should have been:
URL url = new URL(website);
String webUrl = url.getProtocol()+"://"+url.getAuthority()
Thanks everyone for helping.

You can try
​String text = 'http://stackoverflow.com/questions?=34354#es4'
def parts = text.split('.com')
return parts[0] + ".com"
This should solve your problem

Related

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

How to pass non-latin string to url parameter in django?

Using django 1.7 and python 2.7, in a views I have:
page = 0
sex = [u'\u0632\u0646'] #sex = زن
url = "/result/%s/%d" % (sex, page)
return HttpResponseRedirect(url)
Which needs to return:
/result/زن/0
However the resulting url turns out to be:
/result/[u'\u0632\u0646']/0
Which is not what envisage in the pattern:
`url(r'^result/(?P<sex>\w+)/(?P<page>\d+)','userprofile.views.profile_search_result')`,
I also tried
return HttpResponseRedirect( iri_to_uri(url))
but does not solve the problem.
I got really confused and appreciate your help to fix this.
Since sex is a list, you simply need to use the actual element you want:
url = "/result/%s/%d" % (sex[0], page)
Although note that to construct URLs in Django, you should really use the reverse function:
from django.core.urlresolvers import reverse
...
url = reverse('userprofile.views.profile_search_result', kwargs={'sex': sex[0], 'page': page})
url should also be an unicode string for that to work:
page = 0
sex = u'\u0632\u0646' #sex=زن
url = u"/result/%s/%d" % (sex, page)
return HttpResponseRedirect(url)

Word Crawler script not fetching the target words - Python 2.7

I am a newbie to programming. Learning from Udacity. In unit 2, I studied the following code to fetch links from a particular url:
import urllib2
def get_page(url):
return urllib2.urlopen(url).read()
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
def print_all_links(page):
while True:
url, endpos = get_next_target(page)
if url:
print url
page = page[endpos:]
else:
break
print_all_links(get_page('http://en.wikipedia.org'))
It worked perfectly. Today I wanted to modify this code so the script could crawl for a particular word in a webpage rather than URLs. Here is what I came up with:
import urllib2
def get_web(url):
return urllib2.urlopen(url).read()
def get_links_from(page):
start_at = page.find('america')
if start_at == -1:
return None, 0
start_word = page.find('a', start_at)
end_word = page.find('a', start_word + 1)
word = page[start_word + 1:end_word]
return word, end_word
def print_words_from(page):
while True:
word, endlet = get_links_from(page)
if word:
print word
page = page[endlet:]
else:
break
print_words_from(get_web('http://en.wikipedia.org/wiki/America'))
When I run the above, I get no errors, but nothing prints out either. So I added the print keyword -
print print_words_from(get_web('http://en.wikipedia.org/wiki/America'))
When I run, I get None as result. I am unable to understand where am I going wrong. My code probably is messed up, but because there is no error coming up, I am unable to figure out where it is messed up.
Seeking help.
I understand this as you are trying to get it to print the word America for every instance of the word on the Wikipedia page.
You are searching for "america" but the word is written "America". "a" is not equal to "A" which is causing you to find no results.
Also, start_word is searing for 'a', so I adjusted that to search for 'A' instead.
At this point, it was printing 'meric' over and over. I edited your 'word' to begin at 'start_word' rather than 'start_word + 1'. I also adjusted your 'end_word' to be 'end_word+1' so that it prints that last letter.
It is now working on my machine. Let me know if you need any clarification.
def get_web(url):
return urllib2.urlopen(url).read()
def get_links_from(page):
start_at = page.find('America')
if start_at == -1:
return None, 0
start_word = page.find('A', start_at)
end_word = page.find('a', start_word + 1)
word = page[start_word:end_word+1]
return word, end_word
def print_words_from(page):
while True:
word, endlet = get_links_from(page)
if word:
print word
page = page[endlet:]
else:
break
print_words_from(get_web('http://en.wikipedia.org/wiki/America'))

Django no text in response

somewhere in my views.py,I have
def loadFcs(request):
r = requests.get('a url')
res = json.loads(r.text)
#Do other stuff
return HttpResponse('some response')
Now when I call this from my javascript, loadFcs gets called, and probably requests.get gets called asynchronously. So I end up seeing ' TypeError at /loadFcs expected string or buffer' and the trace points to the line with
res = json.loads(r.text)
I also modified my code to check whats the problem, and
def loadFcs(request):
r = requests.get('a url')
res = json.loads(r.text)
if r == None:
print 'r is none'
if r.text == None:
print 'text is None'
#Do other stuff
return HttpResponse('some response')
and noticed that 'text is none'. So I think I need to adjust code so that request.get is synchronous. I think the method execution continues and the return statement is hit even before r.text has some value.
Suggestions?
okay, so I tried the same thing with python command line and it worked BUT not with the same code in my server.
So what was the problem?
apparently, response.text is in some encoding ( UTF8) which my server was not set to receive, so it was just throwing it away and hence null.
Solution: use response.content ( which is raw binary)

Problem with Django url

Going to http://127.0.0.1:8300/projects/cprshelp/edit_file/?filename=manage.py results in a 404 error
urls.py
(r'^projects/(?P<project_name>[\w ,-<>]+)/', include('projects.urls')),
projects/urls.py
(r'edit_file/$', views.edit_file),
What do I need to change to my url files to make this particular url work?
List of valid urls:
^admin/doc/
^admin/
^projects/(?P<project_name>[\w ,-<>]+)/ edit_file/$
^media/(?P<path>.*)$
edit_file function:
def edit_file(request, project_name):
print '**** project name ' + project_name
#project = Project.objects.get(name=project_name)
filename = request.GET['filename']
#content_of_file = open(project.file_location + filename, 'r')
#content_of_file = '\n'.join(content_of_file.readlines())
context = RequestContext(request, {
#"project": project,
#"files": get_files_and_directories(project.file_location),
"filename": filename,
#"content_of_file": content_of_file,
})
return render_to_response("edit_file.html", context)
You need to backslash your - in your regex:
>>> p = re.compile(r'projects/[\w ,-<>]+/')>>> p.search('http://127.0.0.1:8300/projects/cprshelp/edit_file/?filename=manage.py').group()
'projects/cprshelp/edit_file/'
>>> p = re.compile(r'projects/[\w ,\-<>]+/')>>> p.search('http://127.0.0.1:8300/projects/cprshelp/edit_file/?filename=manage.py').group()
'projects/cprshelp/'
>>>
also consider using .get:
filename = request.GET.get('filename','')
Try changing your project/urls.py to
(r'^edit_file/$', views.edit_file),