I am trying to scrape a russian website. However i am stuck with trying to convert a russian cyrillic to DateTime object.
Let's take this html piece for example:
<div class="medium-events-list_datetime">22 января весь день</div>
I am able fetch the content of this div by using lxml, i.e:
date = root.xpath('/html/body/div[1]/div/div[2]/text()')[0].strip()
So the relevant part of of this string is 22 января, which is day and a month.
To get this part i am using the .split() method
Now here lies the problem, i am trying to convert this into DateTime.
I try to use DateParser: https://dateparser.readthedocs.org/en/latest/
,which is supposed to support russian.
However it returns None when i pass this string to dateparser.parse()
Did anyone run into similar issue? I am banging my head against the wall on this one. Any help appreciated :)
try running this example:
#coding=utf-8
import dateparser
s = u"22 января"
print dateparser.parse(s)
It should output 2016-01-22 00:00:00
Important: Make sure that you're actually using utf-8 strings. More info: https://www.python.org/dev/peps/pep-0263/
Otherwise your parsing/splitting might be wrong, so try having a look at the results after the split().
Related
I am trying to use the REGEXP_EXTRACT custom field to pull a portion of my URL using the page dimension in Google Data Studio and cannot figure it out. The page url structure is similar to this -
website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage
I'd like to only extract the middle section "great_practiceinfo_part2". I've tried many different formulas, but nothing seems to work. Does the page dimension work in this scenario? Any help would be much appreciated.
Thanks
It seemed to work fine in Google Sheets when I =REGEXEXTRACT(A3,B3) using your string, website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage for A3 and the regex \/([^\/]*?)\.aspx\? for B3. I'm guessing you just need to learn more about how to make your regex pattern making string.
Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
I am trying to get a fully qualified url, here is the code
string path = string.Format("/sitecore/shell/Applications/Content%20Manager/default.aspx?id={0}&la={1}&fo={0}", contentItem.ID, contentItem.Language);
string fullPath = Sitecore.Web.WebUtil.GetFullUrl(path);
text = text.Replace("$itemUrl$", fullPath);
This returns something like this http://cp.localsite/sitecore/shell/Applications/Content%20Manager/default.aspx?id={DC6B4AE0-929D-4F19-97F4-825796A30781}&la=en&fo={DC6B4AE0-929D-4F19-97F4-825796A30781}
This is generated like a link till ?id= from the id it looks like a normal text. How can i resolve this.I want clickable url for the content. I really appreciate any help.
Thanks.
could you please give a bit more context around what you are trying to achieve. The described behavior is correct, but it is obviously not what you were hoping to achieve.
UPDATED:
Looks like you are on the right track, the only thing i noticed is that you could really get away with just using the fo querystring to get to the right item, you could also use the other ones mentioned here Custom email notification with link - sitecore to get more specific.
For the reason the url being cut off at the id= is that the text parser probably does not like the curly brakets - try encoding the URL and see if that works.
text = HttpUtility.UrlEncode(text);
I've successfully managed to use win32 COM to grab details about the page numbers of a word document. However, when I try to use mydoc.ActiveWindow.Selection.Information(wdActiveEndPageNumber)
I know that the file has been read into memory properly because mydoc.Content.Text prints out all the content.
I get a "wdActiveEndPageNumber is Not Defined" error. Anyone know why this is happening and how to fix this? And is there some python documentation or am I stuck looking at VB and C# on msdn?
import win32com.client
word = win32com.client.Dispatch("Word.Application")
mydoc=word.DOcuments.Open("path:\\to\\file")
mydoc.ActiveWindow.Selection.Information(wdActiveEndPageNumber)
That's because wdActiveEndPageNumber is a constant that not defined by win32com, until you generate the COM type library from the application. Try this:
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants
word = EnsureDispatch("Word.Application")
mydoc = word.Documents.Open("path:\\to\\file")
mydoc.ActiveWindow.Selection.Information(constants.wdActiveEndPageNumber)
you could use the enumerated number. You can find this using the object browser in word. Just got into the vba editor, press f2 then enter wdActiveEndPageNumber as the search term,. When you select it in the results it wil show you its integer value. Then put that in your code.
I'm trying to extract some text out of a poorly design web page for a project, and after a long research and learning python I came close to make it happen, but the web page is poorly designed and can't find the right regular expression to do it.
So here we have what I've accomplished. http://coj.uci.cu/24h/status.xhtml?username=Diego1149&abb=1006 out of the source code of this web page I want to get the whole line of the first instance of an accepted problem. So I thought of this
exprespatFinderTitle = re.compile('<table id="submission" class="volume">.*(<tr class=.*>.*<label class="AC">.*Accepted.*</label>.*</tr>).*</table>')
but what does this does is clipping up until the last <tr> of the table. Can someone help me figure this out?
Im using Python 2.7 whit BeautifulSoup and urllib
Stick to BeautitfulSoup alone; regular expressions are not the tool for HTML parsing:
table = soup.find('table', id='submission')
accepted = table.tbody.find('label', class_='AC')
if accepted:
row = accepted.parent.parent # row with accepted column