Extract all images from zillow html - to google sheets with Regex

Extract all images from zillow html - to google sheets with Regex - regex

Hello i want to extract all images from zillow HTML
https://regex101.com/r/ifKDEa/1
Im trying to catch first 7 images
https://photos.zillowstatic.com/fp/4e900eea9506449f780cf1ffe718ff0e-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/209c61592b999d83ca05b9b1e76edb5c-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/1c1fe680263f8ac0ce477a5e82c589c7-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/f6cbaed9371bacf8e259dbb83336f3a2-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/19369ad2675cbb63a580dc204697ab07-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/a59b1aba32afef92744c3a5b17123275-cc_ft_960.jpg
https://photos.zillowstatic.com/fp/abd1fea3dcdef3deddaa4a0ebbe84696-cc_ft_960.jpg
I want to paste this html in google sheets and extract image links there
Any idea how to do this?

When I check the HTML from your provided URL, it seems that your expected values are put using Javascript. By this, unfortunately, your expected values cannot be directly retrieved by Google Apps Script and Spreadsheet. In this case, as a workaround for indirectly achieving your goal, when the HTML data is manually retrieved using a browser and saved the HTML as a file to Google Drive, I think that your expected values can be retrieved using Google Apps Script. In this answer, I would like to propose this workaround.
Sample script:
Please copy and paste the following script to the script editor of Spreadsheet.
Before you use this script, please put HTML data as a file to your Google Drive. And, please set the filename to filename and run the script. By this, the URLs are retrieved and put on the active sheet.
function myFunction() {
const filename = "samplefilename.html";
const html = DriveApp.getFilesByName(filename).next().getBlob().getDataAsString();
const urls = [...html.matchAll(/https:\/\/photos\.zillowstatic\.com\/fp\/[\w\d\-_]+?\.jpg/g)];
if (!urls || urls.length == 0) return;
// console.log(urls) // When you use this, the URLs can be seen at the log.
// If you want to put the URLs to the active sheet.
const sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange(1, 1, urls.length).setValues(urls);
}
Reference:
matchAll()

Related

get list of files in a sharepoint directory using python

I have a url for sharepoint directory(intranet) and need an api to return list of files in that directory given the url. how can I do that using python?

Posting in case anyone else comes across this issue of getting files from a SharePoint folder from just the folder path.
This link really helped me do this: https://github.com/vgrem/Office365-REST-Python-Client/issues/98. I found so much info about doing this for HTTP but not in Python so hopefully someone else needs more Python reference.
I am assuming you are all setup with client_id and client_secret with the Sharepoint API. If not you can use this for reference: https://learn.microsoft.com/en-us/sharepoint/dev/solution-guidance/security-apponly-azureacs
I basically wanted to grab the names/relative urls of the files within a folder and then get the most recent file in the folder and put into a dataframe.
I'm sure this isn't the "Pythonic" way to do this but it works which is good enough for me.
!pip install Office365-REST-Python-Client
from office365.runtime.auth.client_credential import ClientCredential
from office365.runtime.client_request_exception import ClientRequestException
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
import io
import datetime
import pandas as pd
sp_site = 'https://<org>.sharepoint.com/sites/<my_site>/'
relative_url = "/sites/<my_site/Shared Documents/<folder>/<sub_folder>"
client_credentials = ClientCredential(credentials['client_id'], credentials['client_secret'])
ctx = ClientContext(sp_site).with_credentials(client_credentials)
libraryRoot = ctx.web.get_folder_by_server_relative_path(relative_url)
ctx.load(libraryRoot)
ctx.execute_query()
#if you want to get the folders within <sub_folder>
folders = libraryRoot.folders
ctx.load(folders)
ctx.execute_query()
for myfolder in folders:
print("Folder name: {0}".format(myfolder.properties["ServerRelativeUrl"]))
#if you want to get the files in the folder
files = libraryRoot.files
ctx.load(files)
ctx.execute_query()
#create a dataframe of the important file properties for me for each file in the folder
df_files = pd.DataFrame(columns = ['Name', 'ServerRelativeUrl', 'TimeLastModified', 'ModTime'])
for myfile in files:
#use mod_time to get in better date format
mod_time = datetime.datetime.strptime(myfile.properties['TimeLastModified'], '%Y-%m-%dT%H:%M:%SZ')
#create a dict of all of the info to add into dataframe and then append to dataframe
dict = {'Name': myfile.properties['Name'], 'ServerRelativeUrl': myfile.properties['ServerRelativeUrl'], 'TimeLastModified': myfile.properties['TimeLastModified'], 'ModTime': mod_time}
df_files = df_files.append(dict, ignore_index= True )
#print statements if needed
# print("File name: {0}".format(myfile.properties["Name"]))
# print("File link: {0}".format(myfile.properties["ServerRelativeUrl"]))
# print("File last modified: {0}".format(myfile.properties["TimeLastModified"]))
#get index of the most recently modified file and the ServerRelativeUrl associated with that index
newest_index = df_files['ModTime'].idxmax()
newest_file_url = df_files.iloc[newest_index]['ServerRelativeUrl']
# Get Excel File by newest_file_url identified above
response= File.open_binary(ctx, newest_file_url)
# save data to BytesIO stream
bytes_file_obj = io.BytesIO()
bytes_file_obj.write(response.content)
bytes_file_obj.seek(0) # set file object to start
# load Excel file from BytesIO stream
df = pd.read_excel(bytes_file_obj, sheet_name='Sheet1', header= 0)
Here is another helpful link of the file properties you can view: https://learn.microsoft.com/en-us/previous-versions/office/developer/sharepoint-rest-reference/dn450841(v=office.15). Scroll down to file properties section.
Hopefully this is helpful to someone. Again, I am not a pro and most of the time I need things to be a bit more explicit and written out. Maybe others feel that way too.

You need to do 2 things here.
Get a list of files (which can be directories or simple files) in
the directory of your interest.
Loop over each item in this list of files and check if
the item is a file or a directory. For each directory do the same as
step 1 and 2.
You can find more documentation at https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-folders-and-files-with-rest#working-with-files-attached-to-list-items-by-using-rest
def getFilesList(directoryName):
...
return filesList
# This will tell you if the item is a file or a directory.
def isDirectory(item):
...
return true/false
Hope this helps.

I have a url for sharepoint directory
Assuming you asking about a library, you can use SharePoint's REST API and make a web service call to:
https://yourServer/sites/yourSite/_api/web/lists/getbytitle('Documents')/items?$select=Title
This will return a list of documents at: https://yourServer/sites/yourSite/Documents
See: https://msdn.microsoft.com/en-us/library/office/dn531433.aspx
You will of course need the appropriate permissions / credentials to access that library.

You can not use "server name/sites/Folder name/Subfolder name/_api/web/lists/getbytitle('Documents')/items?$select=Title" as URL in SharePoint REST API.
The URL structure should be like below considering WebSiteURL is the URL of site/subsite containing document library from which you are trying to get files and Documents is the Display name of document library:
WebSiteURL/_api/web/lists/getbytitle('Documents')/items?$select=Title
And if you want to list metadata field values you should add Field names separated by comma in $select.
Quick tip: If you are not sure about the REST API URL formation. Try pasting the URL in Chrome browser (you must be logged in to SharePoint site with appropriate permissions) and see if you get proper result as XML if you are successful then update the REST URL and run the code. This way you will save time of running your python code.

Scraping Aliexpress site with Python don't Give me Correct Result

I have problem in scraping aliexpress site.
https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html
This is one url.
What i want to get.
r = requests.get('https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html')
beautifulsoup
content = soup.find('div', {'id':'j-product-tabbed-pane'})
lxml parsing.
root = html.fromstring(r.content)
results = root.xpath('//img[#alt="aeProduct.getSubject()"]')
f = open('result.html', 'w')
f.write(lxml.html.tostring(results[0]))
f.close()
This my my code but give me false result.
Inspect on browser has that elements
But above code don't give me anything.
I think requests.get don't give me correct contents. But why and how i can solve this problem. They detect as a bot?. How can help me.
Thank you every one.

try this
1-use user agent
2-use proxy
3-disable javascript from this site and refresh it then see if the site have this element or it loads by javascript if it loads by javascript
you should find a way to render JS

exclude a folder and match all .html pattern files in a root folder using regex

I am doing migration from html to Drupal. Using migrate module.
Here in our custom migration script i need to match all .html files from all the folder except images folder.
pass this regex to $list_files = new MigrateListFiles([],[],$regex)
Below is format of html files
/magazines/sample.html
/test/index.html
/test/format_ss1.html
/test/folder/newstyle_1.html
/images/two.html
i need to get only first 2 html files i.e., we are exluding files which ends with '_[0-9]' and '_ss[0-9]' and .hmtl files in images folder.
i have successfully done by excluding 3 and 4 but i can't able to excule .html files in images folder.
$regex = '/[a-zA-Z0-9\-][^_ss\d][^_\d]+\.html/'; //this will do for 3 and 4 files
but i need to exlude images folder..
i have tried like
$regex = '/[^images\/][a-zA-Z0-9\-][^_ss\d][^_\d]+\.html/'; // not working
Where In PHP script it will work
$regex = '~^(?!/images/)[a-zA-Z0-9/-]+(?!_ss\d|\d)\.html$~' //works in php script
can some one help me out this..

Try
/((?!images)[0-9a-zA-Z])+/[^_]*[^\d]+\.html
Matches:
/magazines/sample.html
/test/index.html
/test/folder/newstyle.html
/test/format_ss.html
Does not match:
/test/format_ss1.html
/test/folder/newstyle_1.html
/images/two.html
/images/1.html
/test/folder/newstyle1.html
/test/folder/newstyle_12.html
is this acceptable?

It's a Drupal/Migrate specific issue - the regex is only a regex for the filename (not the directory) as eventually it gets passed to https://api.drupal.org/api/drupal/includes%21file.inc/function/file_scan_directory/7
file_scan_directory($dir, $mask, $options = array(), $depth = 0)
$mask: The preg_match() regular expression of the files to find.
I think the only way to exclude certain directories is to throw a false in the prepareRow() function if the row has a path you don't require.
function prepareRow($row)
The prepareRow() method is called by the source class next() method, after loading the data row. The argument $row is a stdClass object containing the raw data as provided by the source. There are two primary reasons to implement prepareRow():
To modify the data row before it passes through any further methods and handlers: for example, fetching related data, splitting out source fields, combining or creating new source fields based on some logic.
To conditionally skip a row (by returning FALSE).
https://www.drupal.org/node/1132582

Extracting URI from web pages via BASH

I need to pull all the links for a page that resides on an Intranet however am unsure how best to do it. The structure of the site is as follows
List of topics
Topic 1
Topic 2
Topic 3
etc
Now the links reside in each of the topic pages. I want to avoid going through in excess of 500 topic pages manually to extract the URI.
Each of the topic pages has the following structure
http://alias/filename.php?cat=6&number=1
The cat parameter refers to the category and the number parameter refers to the topic.
Once in the topic page the URI I need to extract exists in a particular format again
http://alias/value?id=somevalue
Caveats
I don't have access to the database so the option to trawl through it is not an option
There is only ever a single URI in each topic page
I need to extract the list to a file that simply lists each URI in a new line
I would like to execute some sort of script I can run from the terminal via BASH that will trawl through the topical URI and then the URI in each of the topics.
In a nutshell
How can I extract a list using a script I can run using BASH that will recursively go through all the list of topics and then extract the URI in each of the topic pages and spit out a text file with the each of extracted URI in a new line.

I implement this with Perl, using the HTML::TokeParser and WWW::Mechanize modules:
use HTML::TokeParser;
use WWW::Mechanize;
my $site = WWW::Mechanize->new(autocheck =>1);
my $topicmax = 500; #Note: adjust this to the number of topic pages you have
# loop through each topic page
foreach(1..$topicmax) {
my $topicurl = "http://alias/filename.php?cat=6&number=$_";
# get the page
$site->get($topicurl);
$p = HTML::TokeParser->new(\$site->{content});
# parse the page and extract the links
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href};
# use a regex to test for the link format we want
if($url =~ /^http:\/\/alias\/value\?id=/) {
print "$url\n";
}
}
}
The script prints to stdout, so you just need to redirect it to a file.

understanding jquery ui autocomplete json

I am very new to programming and I am stumped with this problem.
I want to create am autocomplete textbox.
From what I see I would need to use json. However for the source of the json I need a url to a file script, and I do not quite get this part.
This is an example from http://jqueryui.com/demos/autocomplete/#option-source
$( "#birds" ).autocomplete({
source: "search.php",
minLength: 2,
select: function( event, ui ) {
log( ui.item ?
"Selected: " + ui.item.value + " aka " + ui.item.id :
"Nothing selected, input was " + this.value );
}
});
Does it mean that whenever I type something in the autocomplete textbox it accesses the file in the url and the file script would change dynamically according to my input?
Also, I can only see some examples of the url file in php. Can it be done in Django? such as specifying a url as the source and link that url with a view outputting the data?

Whenever you type something in the autocomplete textbox it accesses the url to retrieve the array of data. (Use firebug or chrome developer tools while testing the demo to see the HttpRequests sent as you type)
From the documentation you linked:
"When a String is used, the Autocomplete plugin expects that string to
point to a URL resource that will return JSON data."
So yes, you can use Django as long as the URL returns JSON data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract all images from zillow html - to google sheets with Regex - regex

Related

get list of files in a sharepoint directory using python

Scraping Aliexpress site with Python don't Give me Correct Result

exclude a folder and match all .html pattern files in a root folder using regex

Extracting URI from web pages via BASH

understanding jquery ui autocomplete json

Categories

Resources