Sublime Text regex to re-order Facebook message export - regex

I've exported my Facebook message history and found that messages are displayed as below. Some are in blocks, some blocks are out of order, some are together, all in reverse.
John Doe Sunday, 24 August 2014, 01:18
Hello!
Jane Doe Sunday, 24 August 2014, 01:17
Hi!
What I'm trying to do is to use a regex, [a-z]* and $1, etc., to search and replace in Sublime Text to re-layout the data such that the above becomes sortable in Excel (to get everything in correct order) as below (or any date-sortable manner):
2014.08.24 01:17 Jane Doe Hi
2014.08.24 01:18 John Doe Hello!
Is this possible? I've managed to select the name date and time, but cannot get the variable-length messages consistently and nor re-order/move the date/time or message to the example blow. Does this make sense, or am I wasting my time?
Would it be best in a tab/comma separated way, too?

You'd be much better off using your favorite language to parse the HTML in html/messages.htm. Open the file in Sublime, use a code formatter like HTML-CSS-JS Prettify (note: requires Node.js) to format it, then look at the structure. Basically, each conversation is in reverse order, oldest first, and within the conversation each message is sorted most recent first (yay consistency!). Here's an example:
<div class="thread">Fred Smith, Joe Blow
<div class="message">
<div class="message_header">
<span class="user">Fred Smith</span>
<span class="meta">Monday, June 13, 2011 at 3:42pm EDT</span>
</div>
</div>
<p>Not much Joe. How are you?</p>
<div class="message">
<div class="message_header">
<span class="user">Joe Blow</span>
<span class="meta">Monday, June 13, 2011 at 11:00am EDT</span>
</div>
</div>
<p>Hey there Fred, what's up?</p>
</div>

Related

In ZURB Foundation 6, is there a way to have columns flow around a hidden column?

Using ZURB Foundation 6, is there a way to have columns flow around a hidden column?
So for this code:
<ul class="row large-up-3">
<li class="column column-block large-4" style="display:none;">1</li>
<li class="column column-block large-4">2</li>
<li class="column column-block large-4">3</li>
<li class="column column-block large-4">4</li>
<li class="column column-block large-4">5</li>
<li class="column column-block large-4">6</li>
</ul>
It current displays like this:
2 3
4 5 6
But what I'm hoping to achieve is this:
2 3 4
5 6
Is this feasible?
I don't need to use "display: hidden" if there's another way to do this. That's just placeholder CSS code.
If not, is there a way to do this in Bootstrap or writing my own grid CSS? I can bypass Foundation for this particular content if needed.
Thank you for any help!
Figured it out. In Foundation 6, the following CSS code forces each row to start anew:
.large-up-3 > .column:nth-of-type(3n+1), .large-up-3 > .columns:nth-of-type(3n+1) {
clear: both;
}
Note: I'm using a 3-three column row, so your code will be different, depending on how many columns you have per row.
So if you change clear:both to clear:none, you can then hide columns using display:none, and the columns will automatically fill any gaps.
Why does ZURB do this?
Well, if you change this CSS code, you'll find that your columns don't line up vertically anymore, due to varying column heights.
So this CSS code forces everything to look nice.
If you remove the clear:both code and then run into a alignment problems, you might look into ZURB's "equalizer" code to set all your columns to an equal height.
https://get.foundation/sites/docs/equalizer.html
There are various ways to prettify things further if equal column heights end up looking wrong in your theme, such as using CSS overflow options, but that's getting outside the scope of this question.
I hope this helps someone!

Adding a space between paragraphs when extracting text with BeautifulSoup

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.
You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

Regular Expressions | Delete words on multiple lines before a given word

I scraped several articles from a website and now I am trying to make the corpus more readable by deleting the first part from the text scraped.
The interval that it should be deleted is within the tag <p>Advertisement and the final tag </time> before the article starts. As you can see, the regular expression should delete several words on multiple lines. I tried with the DOTALL sequence but it wasn't successful.
This is my first attempt:
import re
text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)
Try this:
my_pattern=("[\s\S]+\<\/time\>")
If you also want to delete also the following tag </p>, the comma , and the space, you can use this:
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")

Append values in regular expressions

I'm using Xpath and regular expressions to obtain data from a web page
I'm using the following xpath to get the portion I'm interested in.
response.xpath('//*[#id="business-detail"]/div/p').extract()
EDIT:
Which provides the following:
[u'<p><span class="business-phone" itemprop="telephone">(415) 287-4225</span><span class="business-address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">2180 Bryant St. STE 203, </span><span itemprop="addressLocality">San Francisco</span>,\xa0<span itemprop="addressRegion">CA</span>\xa0<span itemprop="postalCode">94110</span></span><span class="business-link">www.klopfarchitecture.com</span> <br><br></p>']
I'm interested in
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
<span itemprop="addressLocality">San Francisco</span>
<span itemprop="addressRegion">CA</span>
<span itemprop="postalCode">94110</span>
So I'm using this regex to extract the data
reg = r'"streetAddress">[0-9]+[^<]*'
reg = r'"addressLocality"[^<]*'
reg = r'"addressRegion"[^<]*'
reg = r'"postalCode"[^<]*'
The problem is that are four of them so I get four variables, I need to append the data to have the full address in one variable to assign it to an Item, what would be an efficient way to accomplish it?
EDIT2:
You're right Roshan Jossey, I can use response.xpath('//*[#itemprop="streetAddress"]').extract()
But still are four labels, addressLocality, addressRegion and postal code. how I merge the results?
I looking for this result:
2180 Bryant St. STE 203, San Francisco, CA 94110
And I'm getting this format for each of the four parts
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
I'd recommend to use just xpaths to solve this problem
response.xpath('//*[#id="business-detail"]/div/p//span[#itemprop="streetAddress"]/text()').extract()[0]
will provide you the street address. You can extract all other elements in a similar fashion. Then its just a matter of concatenating them.
Regular expressions looks like an overkill when such simple xpath solutions exist.

Regular Expression for Date format: D, dd M yy "Tue, 25 Oct 2011"

I am looking for a regular expression to validation Date selected from a datepicker with format of D, dd M yy
$('#expirydate').datepicker({
constrainInput: true,
minDate: 0,
dateFormat: 'D, dd M yy'
});
And the equivalent format in View is like this:
<div class="editor-label">
Expiry Date
</div>
<div class="editor-field">
#Html.TextBox("ExpiryDate", String.Format("{0:ddd, d MMM yyyy}", DateTime.Now), new { id = "expirydate" })
#Html.ValidationMessageFor(model => model.ExpiryDate)
</div>
I could not get it in the regex library.. can anyone help??
Appreciate any feedback.. Thanks...
If it's coming from a datepicker, and you trust that it won't produce impossible dates, then to validate that format you could use:
(Sun|Mon|Tue|Wed|Thu|Fri|Sat|Sun), [1-3]{1}[0-9]{0,1} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}
Well, I wrote a Regex for all the months, started to validate the days, and realized that you have to validate 28-31 depending on the month and year anyway.
Given that, you should skip the Regex and start looking down the road of looking at the MSDN documentation on parsing data and time strings.
If you really want to pre-check it with a regex, (Mon|Tue|Wed|Thu|Fri|Sat|Sun), (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ([0-3]\d) (\d{4}) is the the basic setup.