Condensing a string of time blocks; displaying date ranges - regex

A user selects a number hours (as DateTime objects) when making a booking for a rental space.
List<DateTime> dateTimeList = getDateTimeList();
I convert that list to a presentable string like so:
List<String> hourList = List<String>();
for (DateTime dateTime in dateTimeList) {
String hour = getHour(dateTime, context); // getHour returns e.g. 14:00 or 2pm
String nextHour = getHour(dateTime.add(Duration(hours: 1)), context);
hourList.add(hour + " - " + nextHour);
}
hourList.sort();
return hourList.join(", ");
Eventually I have the following list:
10:00 - 11:00
11:00 - 12:00
15:00 - 16:00
16:00 - 17:00
20:00 - 21:00
Q: How can I condense it, so consecutive blocks are merged? Like so:
10:00 - 12:00
15:00 - 17:00
20:00 - 21:00
I've thought of regex replace and various for loops that get to complicated before I finish... and this is not so delicate:
string = string.replaceAll("- 01:00, 01:00", "");
string = string.replaceAll("- 02:00, 02:00", "");
string = string.replaceAll("- 03:00, 03:00", "");
etc

A RegExp doesn't understand the meaning of the text, just the structure, so you are usually better off parsing the text and handling it with code that understands what is going on.
In this particular case, your textual structure is actually so simple that a RegExp can handle it, because you are looking for - xy:zw, xy:zw for any digits x, y, z and w.
A RegExp mathching that is:
var repeatedTimeRE = RegExp(r"- (\d\d:\d\d), \1 ");
you can then replace do string = string.replaceAll(repatedTimeRE, ""); to join adjacent time intervals where the second starts at exactly the same time as the first ends.
If your format is not precisely as written, say one o'clock can be written both as "1:00" and "01:00", then textual matching becomes much harder.
If you can have overlapping intervals, then a RegExp also can't catch it, say:
01:00 - 02:00, 01:59 - 02:35.
A semantic merge function could recognize the overlap and merge anyway, textual matching only really works for strictly equal texts.

Related

For Loop and If Statement not performing as expected

Here's the code:
# Scrape table data
alltable = driver.find_elements_by_id("song-table")
date = date.today()
simple_year_list = []
complex_year_list = []
dateformat1 = re.compile(r"\d\d\d\d")
dateformat2 = re.compile(r"\d\d\d\d-\d\d-\d\d")
for term in alltable:
simple_year = dateformat1.findall(term.text)
for year in simple_year:
if 1800 < int(year) < date.year: # Year can't be above what the current year is or below 1800,
simple_year_list.append(simple_year) # Might have to be changed if you have a song from before 1800
else:
continue
complex_year = dateformat2.findall(term.text)
complex_year_list.append(complex_year)
The code uses regular expressions to find four consecutive digits. Since there are multiple 4 digit numbers, I want to narrow it down to between 1800 and 2021 since that's a reasonable time frame. simple_year_list, however, prints out numbers that don't follow the conditions.
You aren't saving the right value here:
simple_year_list.append(simple_year)
You should be saving the year:
simple_year_list.append(year)
I would need more information to help further though. Maybe give us a sample of the data you're working through, and the output you're seeing?
You can do it all in regex.
Add start ^ and end $ anchors, and range restriction via pattern:
dateformat1 = re.compile(r"^(1[89]\d\d|20([01]\d|2[01]))$")

Matching diverse dates in Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns.
The data are, however, quite messed up.
I have sometimes full dates:
May 30, 1949
Sometimes full dates are combined with other dates and attributes:
May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940
Sometimes you have timespan:
1905-05 OR 1905-1906
Sometimes only the year
1905
Sometimes year with attributes
August or September 1908
Doesn't seems to follow any specific schema or order.
I would like to extract (at least)ca start and end date year, in order to have two columns:
-----------------------
|start_date | end_date|
|1905 | 1906 |
-----------------------
without the rest of the attributes.
I can find the last date using
value.match(/.*(\d{4}).*?/)[0]
and the first one with
value.match(/.*^(\d{4}).*?/)[0]
but I got some trouble with the two formulas.
The latter cannot match anything in case of:
May 30, 1949 and 1951, published 1979
while in the case of:
Paris, winter 1911-12
The latter formula cannot match anything and the former formula match 1911
Anyone know how I can resolve the problem?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date.
Moreover, I would be glad to have some clue about how to extract other information, such as
if published or printed or executed is present in the text -> copy date to a new column name “execution”.
should be something like create a new column
if(value.match("string1|string2|string3" + (\d{4}), "perform the operation", do nothing)
value.match() is a very useful but sometimes tricky function. To extract a pattern from a text, I prefer to use Python/Jython's regular expressions :
import re
pattern = re.compile(r"\d{4}")
return pattern.findall(value)
From there, you can create a string with all the years concatenated:
return ",".join(pattern.findall(value))
Or select only the first:
return pattern.findall(value)[0]
Or the last:
return pattern.findall(value)[-1]
etc.
Same thing for your sub-question:
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
return pattern.findall(value)[0][1]
Or :
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
m = re.search(pattern, value)
return m.group(2)
Example:
Here is a regex which will extract start_date and end_date in named groups :
If there is only one date, then it consider it's the start_date :
((?<start_date>\d{4}).*?)?(?<end_date>\d{4}|(?<=-)\d{2})?$
Demo

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Python sorting timestamp

I am struggling with something that should be relatively straight forward, but I am getting nowhere.
I have a bunch of data that has a timestamp in the format of hh:mm:ss. The data ranges from 00:00:00 all 24 hours of the day through 23:59:59.
I do not know how to go about pulling out the hh part of the data, so that I can just look at data between specific hours of the day.
I read the data in from a CSV file using:
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
time = row['Time']
This gives me time in the hh:mm:ss format, but now I do not know how to do what I want, which is look at the data from 6am until 6pm. 06:00:00 to 18:00:00.
With the times in 24 hour format, this is actually very simple:
'06:00:00' <= row['Time'] <= '18:00:00'
Assuming that you only have valid timestamps, this is true for all times between 6 AM and 6 PM inclusive.
If you want to get a list of all rows that meet this, you can put this into a list comprehension:
relevant_rows = [row for row in reader if '06:00:00' <= row['Time'] <= '18:00:00']
Update:
For handling times with no leading zero (0:00:00, 3:00:00, 15:00:00, etc), use split to get just the part before the first colon:
> row_time = '0:00:00'
> row_time.split(':')
['0', '00', '00']
> int(row_time.split(':')[0])
0
You can then check if the value is at least 6 and less than 18. If you want to include entries that are at 6 PM, then you have to check the minutes and seconds to make sure it is not after 6 PM.
However, you don't even really need to try anything like regex or even a simple split. You have two cases to deal with - either the hour is one digit, or it is two digits. If it is one digit, it needs to be at least six. If it is two digits, it needs to be less than 18. In code:
if row_time[1] == ':': # 1-digit hour
if row_time > '6': # 6 AM or later
# This is an entry you want
else:
if row_time < '18:00:00': # Use <= if you want 6 PM to be included
# This is an entry you want
or, compacted to a single line:
if (row_time[1] == ':' and row_time > '6') or row_time < '18:00:00':
# Parenthesis are not actually needed, but help make it clearer
as a list comprehension:
relevant_rows = [row for row in reader if (row['Time'][1] == ':' and row['Time'] > '6') or row['Time'] < '18:00:00']
You can use Python's slicing syntax to pull characters from the string.
For example:
time = '06:05:22'
timestamp_hour = time[0:2] #catch all chars from index 0 to index 2
print timestamp_hour
>>> '06'
should produce the first two digits: '06'. Then you can call the int() method to cast them as ints:
hour = int(timestamp_hour)
print hour
>>> 6
Now you have an interger variable that can be checked to see if is between, say, 6 and 18.

Check if string is of SortableDateTimePattern format

Is there any way I can easily check if a string conforms to the SortableDateTimePattern ("s"), or do I need to write a regular expression?
I've got a form where users can input a copyright date (as a string), and these are the allowed formats:
Year: YYYY (eg 1997)
Year and month: YYYY-MM (eg 1997-07)
Complete date: YYYY-MM-DD (eg 1997-07-16)
Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00)
Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)
Complete date plus hours, minutes, seconds and a decimal fraction of a second
YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00)
I don't have much experience of writing regular expressions so if there's an easier way of doing it I'd be very grateful!
Not thoroughly tested and hence not foolproof, but the following seems to work:
var regex:RegExp = /(?<=\s|^)\d{4}(-\d{2}(-\d{2}(T\d{2}:\d{2}(:\d{2}(\.\d{2})?)?\+\d{2}:\d{2})?)?)?(?=\s|$)/g;
var test:String = "23 1997 1998-07 1995-07s 1937-04-16 " +
"1970-0716 1993-07-16T19:20+01:01 1979-07-16T19:20+0100 " +
"2997-07-16T19:20:30+01:08 3997-07-16T19:20:30.45+01:00";
var result:Object
while(result = regex.exec(test))
trace(result[0]);
Traced output:
1997
1998-07
1937-04-16
1993-07-16T19:20+01:01
2997-07-16T19:20:30+01:08
3997-07-16T19:20:30.45+01:00
I am using ActionScript here, but the regex should work in most flavors. When implementing it in your language, note that the first and last / are delimiters and the last g stands for global.
I'd split the input field into many (one for year, month, day etc.).
You can use Javscript to advance from one field to the next once full (i.e. once four characters are in the year box, move focus to month) for smoother entry.
You can then validate each field independently and finally construct the complete date string.