scrape with lxml and python from web si

scrape with lxml and python from web si - python-2.7

I would like to extract certain data from websites.
Originally, I was converting to .txt file and then writing some routines in Python to filter/read out the data, which worked for 95% of the data, which is not sufficient. I found that there is a way with lxml, which I tried, but could not succeed. With XPATH I think I mark the correct position, however, I get only empty brackets as result []. If anybody knew how to correct it, would be highly appreciated.
Thanks.
Peter
from lxml import html
import requests
page=requests.get('http://www.finanzen.net/analyse/ING_Group_NV_overweight-JP_Morgan_Chase__Co__529284')
tree=html.fromstring(page.text)
unternehmen=tree.xpath('/html/body/div[1]/div[8]/div[2]/div[3]/div[4]/div[1]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/td[1]/br')
#This should fetch the information about unternehmen
print unternehmen

That page source code is a mess. Very little id attributes and many anonymous <div>. You could get the nearest element with an id attribute, go up to its parent, root of the table that keeps the element you want to extract. Following xpath expression worked for me. Give it a try:
response.xpath('//a[#id="commentLink"]/ancestor::div[4]/div[2]/div[4]/div[2]/table/tr/td[1]/text()')[0]
It yields:
u'ING Group NV'
The last [0] is because it's hard to parse non-closed <br> tags and td column retrieves three text elements, so I get only the first one. You will have to adapt it to your needs.

Related

Include line breaks in Google Sheets cells when getting form responses from Google Forms

I am wondering if anyone can help with this. I am creating a merged document using HTTP Post via Infusionsoft from a Google Form Response. The HTTP Post automatically pulls the data from Infusionsoft and posts it to the Google Form which then adds it to a Google sheet. I am then using Autocrat to automatically create a letter.
I have managed to make all this work, however, the one issue that I am having is that some of the form entries are paragraph text (so, for instance, the body of the letter, which has several paragraphs). When I pull this data into the sheet once the HTTP post has fired, the text within that cell separates the paragraphs with <br><br>. So, for example, it would be:
"Paragraph one.<br><br>
Paragraph two.<br><br>" etc.
This then merges into the letter with the <br><br> rather than line breaks.
I want it to appear within the merged letter as:
"Paragraph one.
Paragraph two."
within the cell. Ie with line breaks.
Is this possible? Have found another post with this function, but this is the opposite of what I want to achieve.
function lineBreakTest() {
var cellWithLineBreaks = SpreadsheetApp.getActiveSheet().getRange("q3").getValue();
Logger.log(cellWithLineBreaks);
cellWithLineBreaks = cellWithLineBreaks.replace(/\n/g, '<br>');
Logger.log(cellWithLineBreaks);
// Post to your Google Site here. Logger.log is just used to demonstrate.
}
I would also want it to apply to the whole column, so whenever autocrat runs and a new row is added, it would apply the same function.
Would this all happen automatically also?
Any help would be amazing.

see this solution if it fits you:
=ARRAYFORMULA(REGEXREPLACE(A:B, "<br><br>", CHAR(10)&CHAR(10)))
...then you can just hide A:B columns

Having trouble scraping a specfic site with scrapy properly

I went over the tutorial for Scrapy, and I was able to understand how to scrap the site included in the tutorial. But I'm having a little trouble with some of the more complicated sites (at least to me).
I'm attempting to scrape the rows and columns of the insider transactions from this webpage:
http://finviz.com/insidertrading.ashx
I'm using command prompt commands with scrapy to test out if I'm able to scrape the necessary information, so the following commands are what I've have written in the command prompt.
scrapy shell "http://finviz.com/insidertrading.ashx"
I then used firebug from firefox to look at the html code of the page.
I'm able to get some of the information (Stock Name, Name of the Insider and Date) into a list via this code:
response.css('td a.tab-link::text').extract()
However, the rest of the info is missing.
I'm able to get some (maybe most)of the missing info (Cost, Shares, Value etc) via this code
response.css(td::text).extract()
I can't figure out how to cleanly get all info together in one scrape.
Thanks.
EDIT: The other option would be to collect the data iteratively, one row at a time, so I can separate it as I like. I'm brooding over this as well.

Since the data is tabular, the position of table rows and columns is predictable and stable. You can simply extract all text in the row and unpack it into variables:
for row in response.xpath("//tr[#class='insider-option-row']"):
items = row.xpath('td/a/text() | td/text()').extract()
ticker, owner, relationship, date, transaction, cost, shares, value, shares_total, sec_form_4 = items

Writing value on the current textbox - Selenium

Well I'm facing an issue here:
I use Selenium to automate test on my application, and I use the following code to fill a textbox:
driver.find_element_by_id("form_widget_serial_number").send_keys("example", u'\u0009')
All the textboxes have the same id (form_widget_serial_number) so it's impossible to select a specific textbox by using selection by id (the code above selects the first textbox and u'\u0009' changes the cursor position to the next textbox).
Anyway, how can I send values to the "current" textbox without using find_element_by_id or another selector, given that the cursor is in the current textbox?

There are several selectors available to you - Xpath and css in particular are very powerful.
Xpath can even select by the text of an element and then access its second sibling1:
elem = driver.find_element(:xpath, "//div[contains(text(), 'match this text')]/following-sibling::*[2]")
If you need to find the current element that has focus:
focused_elem = driver.switch_to.active_element
And then to find the next one:
elem_sibling = focused_elem.find_element(:xpath, "following-sibling::*")
Using this example, you could easily traverse any sequence of sequential elements, which would appear to solve your problem.
Take a close look at the structure of the page and then see how you would navigate to the element you need from a reference you can get to. There is probably an Xpath or css selector that can make the traversal across the document.
1 FYI, I'm using an old version of Selenium in Ruby, in case your version's syntax is different

My approach would be:
Get a list of all contemplable elements by the known id
Loop over them until you find a focused element
Use send_keys to write into that element
To get the focused element something like
get_element_index('dom=document.activeElement')
might work. See this question for more information.
Edit: If the list you get from 1. is accidently sorted properly the index of an element in this list might even represent the number of u'\u0009' you have to send to reach that specific element.

Use ActionChains utility to do that. Use following code:
from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).send_keys("Hello").perform()

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.

Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

SQLite like with Regular Expressions

I have a column with HTML content. I want to search for words in that column, but only the text, not the HTML code.
For example:
(1) <p class="last">First time I went there...</p>
(2) This is a <em>very</em> subtle colour.
(1) Searching for last doesn't find it, because it's a class name, not content.
(2) Searching for very subtle will find it, ignoring HTML
Is this possible with SQLite directly?
Note: I cannot define functions.

Don't do it with SQLite.
Do it with your programming language, your framework that is using SQLite.
In the table, where you have the column with the html code, add additional columns for data about the html. You will have to gather the data for the extra columns, while you analyze the html with your framework.
Track data about the structure the html format does have and save in an extra column the textual content of the html data.
You can get all tags by simple REGEX:
/<?[^<>]+>?/
Checkout how you receive data by scanning the html data for tags with the regexp above and write an iterated evaluation for tag-content (i.e. if a string in the results-array starts with a "<" it´s a tag, by scanning it with /<\s*\/\s*[^>]+>/ you will see if it is a ending tag and by scanning it with /<\s*[^\/>]+\s*\/\s*>/ you will see if it is a single closed tag. If none of the differentiated states does apply, it is textual content.

There isn't a good way to do that in SQLite directly (you'd need to build a SQLite extension that would parse the HTML and let you search through it like MSSQL's XML field type).
Your best bet is going to be to parse the HTML in your code and write out all the text into a separate column to search on as #Kevin suggests in the comments.
E.g.
ID | HTML | Text
---------------------------------------------------------------------------
1 | <p class="last">First time ...</p> | First time ...
2 | This is a <em>very</em> subtle colour. | This is a very subtle colour.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js