Is there a way to extract in one call all the matched subgroups of a string according to a regular expression.
I have a date like this:
Thu, 07 Apr 2022 15:03:32 GMT
And I created the following regexp to extract all the parts of this date:
let re =
Str.regexp
{|\([a-zA-Z]+\), \([0-9]+\) \([a-zA-Z]+\) \([0-9]+\) \([0-9]+\):\([0-9]+\):\([0-9]+\).*|}
And to extract each parts I use it like this:
let parse_date date =
let re =
Str.regexp
{|\([a-zA-Z]+\), \([0-9]+\) \([a-zA-Z]+\) \([0-9]+\) \([0-9]+\):\([0-9]+\):\([0-9]+\).*|}
in
let wday = Str.replace_first re {|\1|} date in
let day = Str.replace_first re {|\2|} date in
let mon = Str.replace_first re {|\3|} date in
let year = Str.replace_first re {|\4|} date in
let hour = Str.replace_first re {|\5|} date in
let min = Str.replace_first re {|\6|} date in
let sec = Str.replace_first re {|\7|} date in
Format.eprintf "RE DATE: %s %s %s %s %s %s %s#." wday day mon year hour min
sec
If the parts were stored in an array I could easily use it like this:
let parse_date date =
let re =
Str.regexp
{|\([a-zA-Z]+\), \([0-9]+\) \([a-zA-Z]+\) \([0-9]+\) \([0-9]+\):\([0-9]+\):\([0-9]+\).*|}
in
let parts = Str.match_groups re date in (* this function doesn't exist *)
let wday = parts.(1) in
let day = parts.(2) in
let mon = parts.(3) in
let year = parts.(4) in
let hour = parts.(5) in
let min = parts.(6) in
let sec = parts.(7) in
Format.eprintf "RE DATE: %s %s %s %s %s %s %s#." wday day mon year hour min
sec
but this doesn't appear to exist. Is there another way to do it or is my solution the only one available?
Since this isn't a XY problem, my goal is really to extract each part of a date so maybe there's another solution than using Str and I'll be happy to use it.
You can use Str.matched_group to return a particular capture group's match:
let parse_date date =
let re = Str.regexp
{|\([a-zA-Z]+\), \([0-9]+\) \([a-zA-Z]+\) \([0-9]+\) \([0-9]+\):\([0-9]+\):\([0-9]+\).*|} in
if Str.string_match re date 0 then
let wday = Str.matched_group 1 date in
let day = Str.matched_group 2 date in
let mon = Str.matched_group 3 date in
let year = Str.matched_group 4 date in
let hour = Str.matched_group 5 date in
let min = Str.matched_group 6 date in
let sec = Str.matched_group 7 date in
Format.sprintf "RE DATE: %s %s %s %s %s %s %s#." wday day mon year hour min sec
else
"RE DATE: Not matched"
let _ = parse_date "Thu, 07 Apr 2022 15:03:32 GMT" |> print_endline
The Str package is pretty primitive, though. I'd suggest using a different library for regular expressions, like PCRE-Ocaml. It does have a way to get an array of matched groups:
let parse_date2 date =
let rex = Pcre.regexp
{|([a-zA-Z]+), ([0-9]+) ([a-zA-Z]+) ([0-9]+) ([0-9]+):([0-9]+):([0-9]+).*|} in
try
let parts = Pcre.exec ~rex date |> Pcre.get_substrings in
let wday = parts.(1) in
let day = parts.(2) in
let mon = parts.(3) in
let year = parts.(4) in
let hour = parts.(5) in
let min = parts.(6) in
let sec = parts.(7) in
Format.sprintf "RE DATE: %s %s %s %s %s %s %s#." wday day mon year hour min sec
with Not_found -> "RE DATE: Not matched"
let _ = parse_date2 "Thu, 07 Apr 2022 15:03:32 GMT" |> print_endline
For simple format with fixed number of fields and separators, Scanf might be enough:
let date s = Scanf.sscanf s "%s#, %02d %s %d %d:%d:%d %s"
(fun day_name day month year h m s timezone ->
day_name,day,month,year,h,m,s,timezone
)
let x = date "Thu, 07 Apr 2022 15:03:32 GMT"
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I wrote the following regex to match date strings looking like:
2019/01/02 08:20:19
the regex is val reg = "([\\d]{4})/([\\d]{2})/([\\d]{2}) ([\\d]{2}).*.r"
The Scala function is:
val dateExtraction: String => Map[String, String] = {
string: String => {
string match {
case reg(year, month, day, hour) =>
Map(YEAR -> year, MONTH -> month, DAY -> day, HOUR -> hour )
case _ => Map(YEAR -> "", MONTH -> "", DAY -> "", HOUR -> "")
}
}
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val HOUR= "HOUR"
I want to get the year, month, day and hour from the regex.
But the date above is not parsed as expected and I get a null result. Any idea how to fix this, please.
I would use java.time for such a problem, like:
val input = "2019/01/02 08:20:19";
val formatter = DateTimeFormatter.ofPattern("yyyy/MM/dd HH:mm:ss")
val dt = LocalDateTime.from(formatter.parse(input)).atZone(ZoneId.systemDefault())
dt.getYear() // 2019
dt.getMonthValue() // 1
dt.getDayOfMonth() // 2
dt.getHour() // 8
I wrote the following code :
val reg = "([\\d]{4})-([\\d]{2})-([\\d]{2})(T)([\\d]{2}):([\\d]{2})".r
val dataExtraction: String => Map[String, String] = {
string: String => {
string match {
case reg(year, month, day, symbol, hour, minutes) =>
Map(YEAR -> year, MONTH -> month, DAY -> day, HOUR -> hour)
case _ => Map(YEAR -> "", MONTH -> "", DAY -> "", HOUR -> "")
}
}
}
val YEAR = "YEAR"
val MONTH = "MONTH"
val DAY = "DAY"
val HOUR = "HOUR"
This function is supposed to be applied to strings having the following format: 2018-08-22T19:10:53.094Z
When I call the function :
dataExtractions("2018-08-22T19:10:53.094Z")
Your pattern, for all its deficiencies, does work. You just have to unanchor it.
val reg = "([\\d]{4})-([\\d]{2})-([\\d]{2})(T)([\\d]{2}):([\\d]{2})".r.unanchored
. . .
dataExtraction("2018-08-22T19:10:53.094Z")
//res0: Map[String,String] = Map(YEAR -> 2018, MONTH -> 08, DAY -> 22, HOUR -> 19)
But the comment from #CAustin is correct, you could just let the Java LocalDateTime API handle all the heavy lifting.
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter._
val dt = LocalDateTime.parse("2018-08-22T19:10:53.094Z", ISO_DATE_TIME)
Now you have access to all the data without actually saving it to a Map.
dt.getYear //res0: Int = 2018
dt.getMonthValue //res1: Int = 8
dt.getDayOfMonth //res2: Int = 22
dt.getHour //res3: Int = 19
dt.getMinute //res4: Int = 10
dt.getSecond //res5: Int = 53
Your pattern matches only strings that look exactly like yyyy-mm-ddThh:mm, while the one you are testing against has milliseconds and a Z at the end.
You can append .* at the end of your pattern to cover strings that have additional characters at the end.
In addition, let me show you a more idiomatic way of writing your code:
// Create a type for the data instead of using a map.
case class Timestamp(year: Int, month: Int, day: Int, hour: Int, minutes: Int)
// Use triple quotes to avoid extra escaping.
// Don't capture parts that you will not use.
// Add .* at the end to account for milliseconds and timezone.
val reg = """(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}).*""".r
// Instead of empty strings, use Option to represent a value that can be missing.
// Convert to Int after parsing.
def dataExtraction(str: String): Option[Timestamp] = str match {
case reg(y, m, d, h, min) => Some(Timestamp(y.toInt, m.toInt, d.toInt, h.toInt, min.toInt))
case _ => None
}
// It works!
dataExtraction("2018-08-22T19:10:53.094Z") // => Some(Timestamp(2018,8,22,19,10))
I have 2 variables where I get 2 times from datePicker and I need to save on a variable the difference between them.
let timeFormatter = DateFormatter()
timeFormatter.dateFormat = "HHmm"
time2 = timeFormatter.date(from: timeFormatter.string(from: datePicker.date))!
I have tried to get the timeIntervalSince1970 from both of them and them substract them and get the difference on milliseconds which I will turn back to hours and minutes, but I get a very big number which doesn't corresponds to the actual time.
let dateTest = time2.timeIntervalSince1970 - time1.timeIntervalSince1970
Then I have tried using time2.timeIntervalSince(date: time1), but again the result milliseconds are much much more than the actual time.
How I can get the correct time difference between 2 times and have the result as hours and minutes in format "0823" for 8 hours and 23 minutes?
The recommended way to do any date math is Calendar and DateComponents
let difference = Calendar.current.dateComponents([.hour, .minute], from: time1, to: time2)
let formattedString = String(format: "%02ld%02ld", difference.hour!, difference.minute!)
print(formattedString)
The format %02ld adds the padding zero.
If you need a standard format with a colon between hours and minutes DateComponentsFormatter() could be a more convenient way
let formatter = DateComponentsFormatter()
formatter.allowedUnits = [.hour, .minute]
print(formatter.string(from: time1, to: time2)!)
TimeInterval measures seconds, not milliseconds:
let date1 = Date()
let date2 = Date(timeIntervalSinceNow: 12600) // 3:30
let diff = Int(date2.timeIntervalSince1970 - date1.timeIntervalSince1970)
let hours = diff / 3600
let minutes = (diff - hours * 3600) / 60
To get duration in seconds between two time intervals, this can be used -
let time1 = Date(timeIntervalSince1970: startTime)
let time2 = Date(timeIntervalSince1970: endTime)
let difference = Calendar.current.dateComponents([.second], from: time1, to: time2)
let duration = difference.second
Now you can do it in swift 5 this way,
func getDateDiff(start: Date, end: Date) -> Int {
let calendar = Calendar.current
let dateComponents = calendar.dateComponents([Calendar.Component.second], from: start, to: end)
let seconds = dateComponents.second
return Int(seconds!)
}
The date format that I'm passing my DateFormatter is not working.
Why is the year appearing as the first component of the date when I specify month in the dateFormat? Why is my am/pm marker in the dateFormat being ignored? Finally, why or how do I correct for time zone? I didn't specify, yet a different time zone has been used.
Thanks!
if let publishDateString = post["publishDate"] as? String {
print("publishDateString is \(publishDateString)") // 2/17/2016 2:49:00 PM
let myDateFormatter = DateFormatter()
myDateFormatter.dateFormat = "M/d/yyyy h:mm:ss a"
let dateFromString = myDateFormatter.date(from: publishDateString)!
print("My date from string is \(dateFromString)") // 2016-02-17 19:49:00 +0000
}
I am trying to get a proper structured output into a csv.
Input:
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Code:
import pandas as pd
from datetime import datetime,time
import numpy as np
fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
#df.to_csv((r[r.LogCount > 0])'example.csv')
#print(r[r.LogCount > 0]) -- This gives the correct count and unique count but I want to write the output in a structure.
print (r['StartTime'], ['EndTime'], ['Day'], ['LogCount'], ['UniqueIDCount'])
Output: This is the output that I am getting which is not what I am looking for.
(2004-01-05 00:00:00 00:00:00
2004-01-05 01:00:00 01:00:00
2004-01-05 02:00:00 02:00:00
2004-01-05 03:00:00 03:00:00
2004-01-05 04:00:00 04:00:00
2004-01-05 05:00:00 05:00:00
2004-01-05 06:00:00 06:00:00
2004-01-05 07:00:00 07:00:00
2004-01-05 08:00:00 08:00:00
2004-01-05 09:00:00 09:00:00
And the Expected output headers are
StartTime, EndTime, Day, Count, UniqueIDCount
How do I structure the Write statement in code to have the above mentioned columns in my output csv.
Try This:
rout = r[['StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
print rout
rout.to_csv('results.csv', index=False)