Why does icrawler pause when downloading dozens of images? - google-crawlers

I want to download image from Google using icrawler. I set the maximum number of download to 1000. But I just get 92 images when it stops. Moreover, the result is different every time I run it, which is less than 100.
from icrawler.builtin import GoogleImageCrawler
for var in ['car front bumper damage']:
var_folder = var.replace(" ", "_")
image_folder = '/content/drive/MyDrive/DataStor/Crawler-datasets/'
path = image_folder + var_folder
import os
try:
os.makedirs(path)
except FileExistsError:
print("File already exists")
print(f'Collecting images for {var}......')
google_Crawler = GoogleImageCrawler(downloader_threads=4, storage = {'root_dir': path})
google_Crawler.crawl(keyword = var , max_num = 1000)
print(google_Crawler.feeder.in_queue.qsize())
I don't know if the parameters are not set correctly.

this is because when you are crawling in google images only the first page is processed so you cannot get all your 1000 images. a solution is to crawl multiple times and in different periodes :
google_Crawler.crawl(keyword = var, max_num=350, date_min=date(2019, 1, 1), date_max=date(2019, 12, 31))
google_Crawler.crawl(keyword = var, max_num=350, date_min=date(2020,1,1), date_max=date(2020, 12, 31), file_idx_offset='auto')
google_Crawler.crawl(keyword = var, max_num=350, date_min=date(2021,1,1), date_max=date(2021, 12, 31), file_idx_offset='auto')
you can crawl more just you have to specifie a different periode if you don't want to have duplicate images
dont for get to :
from datetime import date

Related

AttributeError: 'NoneType' object has no attribute 'shape' using OpenCV

I am reading images from google drive mounted to google colab. I have two folders, one with positive covid-19 chest x-rays and another with normal chest x-rays. I am trying to show these images side-by-side for comparison. Here are images of the code and error:
First Lines Of Code
Error Image
Here is also the written code:
Cimages = ('/content/drive/My Drive/Data/Covid')
Nimages = ('/content/drive/My Drive/Data/Normal')
import skimage
from skimage.transform import resize
def plot(i):
normal = cv2.imread(dataset +'Normal//' + Nimages[i])
normal = skimage.transform.resize(normal, (150,150,3))
covid = cv2.imread(dataset +'Covid//' + Cimages[i])
covid = skimage.transform.resize(normal, (150,150,3), mode = reflect)
pair = np.concatenate((normal, covid), axis = 1)
print('Normal vs. Covid')
plt.figure(figsize=(10,5))
plt.imshow(pair)
plt.show()
for i in range(0,3):
plot(i)
This gives me an error:
AttributeError Traceback (most recent call last)
<ipython-input-52-237aff042641> in <module>()
1 for i in range(0,3):
----> 2 plot(i)
<ipython-input-50-85bb2e03725c> in plot(i)
3 def plot(i):
4 normal = cv2.imread(dataset +'Normal//' + Nimages[i])
----> 5 normal = skimage.transform.resize(normal, (150,150,3))
6 covid = cv2.imread(dataset +'Covid//' + Cimages[i])
7 covid = skimage.transform.resize(normal, (150,150,3), mode = reflect)
/usr/local/lib/python3.6/dist-packages/skimage/transform/_warps.py in resize(image, output_shape, order, mode, cval, clip, preserve_range, anti_aliasing, anti_aliasing_sigma)
89 output_shape = tuple(output_shape)
90 output_ndim = len(output_shape)
---> 91 input_shape = image.shape
92 if output_ndim > image.ndim:
93 # append dimensions to input_shape
AttributeError: 'NoneType' object has no attribute 'shape'
So it seems to be occurring in the skimage.tranform.resize line of code. Please help.
The issue is not with the function
skimage.tranform.resize
but with the reading of the image
normal = cv2.imread(dataset +'Normal//' + Nimages[i])
not sure what are you trying to do there but Nimages[i] won't give you first file in a folder but it will yield first character of a string, in your case /. Then you will send dataset variable + Normal// + / which is basically in your case Normal/// and then you will try to read image on that path, but without doubt the is no image there in which case opencv will return to you None (which is basically nothing). and then you try to resize None with skimage which will fail.
better option would be to read the image directly or in a loop that could look somehow like this:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(Nimages) if isfile(join(Nimages , f))]
for image_path in onlyfiles:
normal = cv2.imread(join(Nimages , image_path))
normal = skimage.transform.resize(normal, (150,150,3))
assuming that there are only images in your mentioned directory.

How to fix 'NoneType' object is not subscriptable' error in while loop

Windows 10
Python 3.7
Anaconda 1.9.7
Spyder 3.3.3
PsychoPy for Python 2.7
I am coding an experiment that needs to present images in a random order for the participant to respond to. I am able to get the images in an array, but to present them one at a time I am using a while loop with a variable that increases by 1 every time it goes through the loop. It is not recognizing the variable as a number and therefore the array cannot call anything.
I've tried not randomizing the variable to see if that is the issue, but it just seems to be that my variable i is not being read as a number
#import packages
import random, os
from psychopy import core, visual, event
from PIL import Image
#setup screen with specs and draw
win = visual.Window([400, 300], monitor="testMonitor")
message = visual.TextStim(win, text="")
message.draw()
win.flip()
core.wait(3.0)
#set image size and populate array with images
stim_size = (0.8, 0.8)
image = [i for i in os.listdir('C:/Users/*/psychopy-tests')
if i.endswith('.bmp')]
#randomize image order
images = random.shuffle(image)
this is where my issue seems to be
i = 0
while i != 29: #there are only 28 images
new = images[i] #this is where the issue is
image_stim = Image.open(new)
stim = visual.ImageStim(win, image_stim, size = (stim_size))
stim.draw()
win.update()
output = []
if event.getKeys(keyList=['space']):
output[i] = 1
if event.getKeys(['escape']):
win.close()
core.quit()
if event.getKeys(keyList=None):
output[i] = 0
core.wait(5.0)
i = i + 1
The random.shuffle shuffles in place and doesn’t return anything i.e., It returns None.
Therefor images is None and not subscriptable.
source

For Loop trying to scrape TripAdvisor Restaurant data

I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().
I am still a novice at this so any help would be awesome.
#import libraries
import requests
from bs4 import BeautifulSoup
import csv
#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
entries = str(30)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break
Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future
for i in range(30, 120, 30):
while i <= range:
i = str(i)
#url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
r1 = requests.get(url1)
data1 = r1.text
soup1 = BeautifulSoup(data1, "html.parser")
for link in soup1.findAll('a', {'property_title'}):
print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
print link.string
break

Rewriting some functions for xlsxwriter box borders from Python 2 to Python 3

I am having some problem getting xlsxwriter to create box borders around a number of cells when creating a Excel sheet. After some searching I found a thread here where there was a example on how to do this in Python 2.
The link to the thread is:
python XlsxWriter set border around multiple cells
The answer I am trying to use is the one given by aubaub.
I am using Python 3 and is trying to get this to work but I am having some problems with it.
The first thing I did was changing xrange to range in the
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop),
and then I changed dict.iteritems() to dict.items() in
def add_to_format(existing_format, dict_of_properties, workbook):
Since there have been some changes to this from Python 2 to 3.
But the next part I am struggling with, and kinda have no idea what to do, and this is the
return(workbook.add_format(dict(new_dict.items() + dict_of_properties.items())))
part. I tried to change this by adding the two dictionaries in another way, by adding this before the return part.
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
But this is not working, I have not been using dictionaries a lot before, and am kinda blank on how to get this working, and if it there have been some other changes to xlsxwriter or other factors that prevent this from working. Does anyone have some good ideas for how to solve this?
Here I have added a working example of the code and problem.
import pandas as pd
import xlsxwriter
import numpy as np
from xlsxwriter.utility import xl_range
#Adding the functions from aubaub copied from question on Stackoverflow
# https://stackoverflow.com/questions/21599809/python-xlsxwriter-set-border-around-multiple-cells/37907013#37907013
#And added the changes I thought would make it work.
def add_to_format(existing_format, dict_of_properties, workbook):
"""Give a format you want to extend and a dict of the properties you want to
extend it with, and you get them returned in a single format"""
new_dict={}
for key, value in existing_format.__dict__.items():
if (value != 0) and (value != {}) and (value != None):
new_dict[key]=value
del new_dict['escapes']
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop):
"""Makes an RxC box. Use integers, not the 'A1' format"""
rows = row_stop - row_start + 1
cols = col_stop - col_start + 1
for x in range((rows) * (cols)): # Total number of cells in the rectangle
box_form = workbook.add_format() # The format resets each loop
row = row_start + (x // cols)
column = col_start + (x % cols)
if x < (cols): # If it's on the top row
box_form = add_to_format(box_form, {'top':1}, workbook)
if x >= ((rows * cols) - cols): # If it's on the bottom row
box_form = add_to_format(box_form, {'bottom':1}, workbook)
if x % cols == 0: # If it's on the left column
box_form = add_to_format(box_form, {'left':1}, workbook)
if x % cols == (cols - 1): # If it's on the right column
box_form = add_to_format(box_form, {'right':1}, workbook)
sheet_name.write(row, column, "", box_form)
#Adds dataframe with some data
frame1 = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
#Add frame to Excel sheet
frame1.to_excel(writer, sheet_name='Sheet1', startcol= 1, startrow= 2)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
#Add some formating to the table
format00 = workbook.add_format()
format00.set_bold()
format00.set_font_size(14)
format00.set_bg_color('#F2F2F2')
format00.set_align('center')
worksheet.conditional_format(xl_range(2, 1, 2, 5),
{'type': 'no_blanks',
'format': format00})
box(workbook, 'Sheet1', 3, 1, 12, 5)
writer.save()
I stumbled on this when trying to see if anyone else had posted a better way to deal with formats. Don't use my old way; whether you could make it work with Python 3 or not, it's pretty crappy. Instead, grab what I just put here: https://github.com/Yoyoyoyoyoyoyo/XlsxFormatter.
If you use sheet.cell_writer() instead of sheet.write(), then it will keep a memory of the formats you ask for on a cell-by-cell basis, so writing something new in a cell (or adding a border around it) won't delete the cell's old format, but adds to it instead.
A simple example of your code:
from format_classes import Book
book = Book(where_to_save)
sheet = book.add_book_sheet('Sheet1')
sheet.box(3, 1, 12, 5)
# add data to the box with sheet.cell_writer(...)
book.close()
Look at the code & the README to see how to do other things, like format the box's borders or backgrounds, write data, apply a format to an entire worksheet, etc.

Parsing HTML Tables with BS4

I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.
I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[0:]:
col = row.findAll('tr')
name = col[1].string
position = col[3].string
player = (name, position)
print "|".join(player)
Here's the error I'm getting:
line 14, in name = col[1].string
IndexError: list index out of range.
--UPDATE--
Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end?
Updated Code:
import urllib2
from bs4 import BeautifulSoup
import re
url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
for row in table.findAll('tr')[1:250]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
player = (name, position)
print "|".join(player)
I figured it out after only 8 hours or so. Learning is fun. Thanks for the help Kevin!
It now includes the code to output the scraped data to a csv file. Next up is taking that data and filtering out for certain positions....
Here's my code:
import urllib2
from bs4 import BeautifulSoup
import csv
url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
table = soup.find('table')
f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
col = row.findAll('td')
name = col[1].getText()
position = col[3].getText()
height = col[4].getText()
weight = col[5].getText()
forty = col[7].getText()
bench = col[8].getText()
vertical = col[9].getText()
broad = col[10].getText()
shuttle = col[11].getText()
threecone = col[12].getText()
player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
f.writerow(player)
I can't run your script due to firewall permissions, but I believe the problem is on this line:
col = row.findAll('tr')
row is a tr tag, and you're asking BeautifulSoup to find all tr tags within that tr tag. You probably meant to do:
col = row.findAll('td')
Furthermore, since the actual text isn't directly inside the tds but is also hidden within nested divs and as, it may be useful to use the getText method instead of .string:
name = col[1].getText()
position = col[3].getText()
Simple way to parse the table column wise:
def table_to_list(table):
data = []
all_th = table.find_all('th')
all_heads = [th.get_text() for th in all_th]
for tr in table.find_all('tr'):
all_th = tr.find_all('th')
if all_th:
continue
all_td = tr.find_all('td')
data.append([td.get_text() for td in all_td])
return list(zip(all_heads, *data))
r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])