regular expression in R with nth occurrence - regex

I am reading a text file in R and want to replace every 3rd occurrence of '|' with '\n' here is my code and input data
**Input Data**
======================
'Monday, November 2, 2015|10:21:27|17:58:12|Tuesday, November 3, 2015|10:13:09|18:52:44|Wednesday, November 4, 2015|10:11:52|18:40:36|Thursday, November 5, 2015|10:31:42|18:16:57|Friday, November 6, 2015|10:13:13|--|Saturday, November 7, 2015|--|--|Sunday, November 8, 2015|--|--|Monday, November 9, 2015|--|--|Tuesday, November 10, 2015|10:03:20|18:07:52|Wednesday, November 11, 2015|09:40:20|18:42:20|Thursday, November 12, 2015|10:38:56|18:37:20|Friday, November 13, 2015|10:45:26|18:09:54|Saturday, November 14, 2015|--|--|Sunday, November 15, 2015|--|--|Monday, November 16, 2015|--|--|Tuesday, November 17, 2015|10:11:43|18:36:15|Wednesday, November 18, 2015|--|--|Thursday, November 19, 2015|--|--|Friday, November 20, 2015|12:14:25|20:25:08|Saturday, November 21, 2015|--|--|Sunday, November 22, 2015|--|--|Monday, November 23, 2015|10:08:08|17:57:35|Tuesday, November 24, 2015|14:30:32|--|'
**My R-Code**
====================
emp <- readChar(FileDir, (file.info(FileDir)$size-172))
emp <- gsub("\r\n","|",emp)
empTMP <- gsub('([^|]*|[^|]*|[^|]*)|',"\1\n",emp)
**output**
====================
"\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n|\001\n"
**Required output**
====================
Monday, November 2, 2015|10:21:27|17:58:12
Tuesday, November 3, 2015|10:13:09|18:52:44
Wednesday, November 4, 2015|10:11:52|18:40:36
Thursday, November 5, 2015|10:31:42|18:16:57
Friday, November 6, 2015|10:13:13|--
Saturday, November 7, 2015|--|--
Sunday, November 8, 2015|--|--
Monday, November 9, 2015|--|--
Tuesday, November 10, 2015|10:03:20|18:07:52
Wednesday, November 11, 2015|09:40:20|18:42:20
Thursday, November 12, 2015|10:38:56|18:37:20
Friday, November 13, 2015|10:45:26|18:09:54
Saturday, November 14, 2015|--|--
Sunday, November 15, 2015|--|--
Monday, November 16, 2015|--|--
Tuesday, November 17, 2015|10:11:43|18:36:15
Wednesday, November 18, 2015|--|--
Thursday, November 19, 2015|--|--
Friday, November 20, 2015|12:14:25|20:25:08
Saturday, November 21, 2015|--|--
Sunday, November 22, 2015|--|--
Monday, November 23, 2015|10:08:08|17:57:35
Tuesday, November 24, 2015|14:30:32|--
Kindly help what I am doing wrong, I check the above regular expression in text editor it works perfectly fine however in "R" it is not producing the correct result.

The following works:
#input <- #your input string
x <- strsplit(input, split = "|", fixed = TRUE)[[1L]]
idx <- seq(3L, length(x), by = 3L)
x[idx] <- paste0(x[idx], "\n")
x[-idx] <- paste0(x[-idx], "|")
paste(x, collapse = "")
Or in one command:
paste(paste0(x <- strsplit(input, split = "|", fixed = TRUE)[[1L]],
rep_len(c("|", "|", "\n"), length(x))), collapse = "")
And if you wanted to stick with gsub, this works as well:
gsub("([^|]*\\|[^|]*\\|[^|]*)\\|", "\\1\n", input)
Broken down (regex101 colored version):
gsub(paste0("(", #start capturing group 1
"[^|]*", #Matching anything but | 0 or more times
"\\|", #Match | (must escape because it's reserved for OR)
"[^|]*\\|", #again
"[^|]*", #again matching anything but |
")", #end captured group
"\\|"), #captured group is followed by a third |
"\\1\n",input) #replace match with captured group followed by \n
# (instead of |)
(just noticed your original attempt is very close. just that you forgot to escape things properly: "\\1", not "\1", and "|" is reserved so we have to escape that as well. Also #CAFEBABE is right that this seems better suited to awk...)

empTMP <- gsub('([^|]*|[^|]*|[^|]*)|',"\1\n",emp)
this is the line which causes imho the trouble.
It should be
empTMP <- gsub('([^|]*|[^|]*|[^|]*)|',"\\1\n",emp)
(note the \\1)
On the side: why do you want to use R for this task. Looks like something for standard shell scripting.
On the side 2: why teradata? On which TD box do you use R?

Related

Display all dates of November in a template

I'm trying to learn datetime and I'm currently trying to display all dates of November in an html template, in views I have:
year = today.year
month= today.month
num_days = calendar.monthrange(year, month)[1]
days = [datetime.date(year, month, day) for day in range(1, num_days+1)]
for days in days:
days_str = days.strftime('%A, %B, %d, %Y')
print(days_str)
context = {'': }
return render(request, 'template.html', context)
The output of the above is:
Monday, November, 01, 2021
Tuesday, November, 02, 2021
Wednesday, November, 03, 2021
Thursday, November, 04, 2021
Friday, November, 05, 2021
Saturday, November, 06, 2021
Sunday, November, 07, 2021
Monday, November, 08, 2021
Tuesday, November, 09, 2021
Wednesday, November, 10, 2021
Thursday, November, 11, 2021
Friday, November, 12, 2021
Saturday, November, 13, 2021
Sunday, November, 14, 2021
Monday, November, 15, 2021
Tuesday, November, 16, 2021
Wednesday, November, 17, 2021
Thursday, November, 18, 2021
Friday, November, 19, 2021
Saturday, November, 20, 2021
Sunday, November, 21, 2021
Monday, November, 22, 2021
Tuesday, November, 23, 2021
Wednesday, November, 24, 2021
Thursday, November, 25, 2021
Friday, November, 26, 2021
Saturday, November, 27, 2021
Sunday, November, 28, 2021
Monday, November, 29, 2021
Tuesday, November, 30, 2021
How to display above dates in a template?
Simply place all dates in a list and pass the list to the context of the template.
year = today.year
month= today.month
num_days = calendar.monthrange(year, month)[1]
days = [datetime.date(year, month, day) for day in range(1, num_days+1)]
days_list = []
for days in days:
days_str = days.strftime('%A, %B, %d, %Y')
days_list.append(days_str)
context = {'days_list':days_list}
return render(request, 'template.html', context)
Then in your template
{% for day in days_list %}
{{day}}
{% endfor %}
Alternatively, you can pass the dates as datetime objects to the template and format them there using the Django built-in date filter.

FILTER non-working days from range/list of dates

I'm trying to create a list of dates between a start date and an end date (done). But now, I want to FILTER weekends out of that list.
The start date is defined, but the end date is based on a number of working days after the start date. The problem is, when I create the list using the following formula, all dates in between are included and I've made numerous attempts to FILTER said dates using WORKDAY.INTL and REGEXMATCH without success. Is it possible to modify this particular formula or do I need to start over with something different?
=ArrayFormula(TO_DATE(row(indirect("A"&A2):indirect("A"&B2))))
Here is an example of what I've done.
This is what I'm getting:
Friday, October 4, 2019
Saturday, October 5, 2019
Sunday, October 6, 2019
Monday, October 7, 2019
Tuesday, October 8, 2019
Wednesday, October 9, 2019
Thursday, October 10, 2019
Friday, October 11, 2019
Saturday, October 12, 2019
Sunday, October 13, 2019
This is what I'm after:
Friday, October 4, 2019
Monday, October 7, 2019
Tuesday, October 8, 2019
Wednesday, October 9, 2019
Thursday, October 10, 2019
Friday, October 11, 2019
Monday, October 14, 2019
Tuesday, October 15, 2019
Wednesday, October 16, 2019
Thursday, October 17, 2019
See if this works
=query(ArrayFormula(TO_DATE(row(indirect("A"&A2):indirect("A"&B2)))), "where dayOfWeek(Col1) <> 7 and dayOfWeek(Col1) <> 1")
you can do it like this:
=ARRAYFORMULA(FILTER(ROW(INDIRECT("A"&A2&":A"&B2)),
REGEXMATCH(TEXT(ROW(INDIRECT("A"&A2&":A"&B2)), "ddd"), "[^(Sat|Sun)]")))

Django "Latest" filter with multiple values

I have a table that looks like this:
Date Value
Oct. 23, 2018 -400
Oct. 23, 2018 -1100
Oct. 23, 2018 -200
Oct. 22, 2018 -400
Oct. 22, 2018 -1100
Oct. 21, 2018 -400
I would like to return the latest value for the date, but with multiple results.
filter().latest() only returns one object. I'd need three in this case.
Thanks!
You can give a try:
filter('some_filter_conditions').order_by('-date')[:3]

display each day's date python from today

I was able to display the week that starts every Saturday by:
today = now().date()
sat_offset = (today.weekday() - 5) % 7
week_start = today - datetime.timedelta(days=sat_offset)
This will display the week from last Saturday but how would I show the dates of each day forward as well? So if the week: Oct. 27, 2018 is display it should say:
Saturday : Oct. 27, 2018
Sunday: Oct. 28, 2018
Monday: Oct. 29, 2018
Tuesday: Oct. 30, 2018
Wednesday: Oct. 31, 2018
Thursday: Nov. 01, 2018
Friday: Nov. 02, 2018
Thank you for your help.
You can iterate through the days of the week using range and time delta like so:
for i in range(7):
week_start += datetime.timedelta(days=1)
print(week_start.strftime("%A %d. %B %Y"))
This will produce a dates like:
Monday : Oct. 28, 2018
Tuesday : Oct. 29, 2018
Wednesday : Oct. 30, 2018
Thursday : Oct. 31, 2018
Friday : Nov. 01, 2018
Saturday : Nov. 02, 2018
Sunday : Nov. 03, 2018
You can format the string how ever you want. Here is some info on dates in python.

Finding unique file names from an html file

$ cat downloaded_file.html
1373 STDMON11202010_company.txt<br> Monday, November 22, 2010 1:31 AM
How do I search an html file from my shell script and select the unique filenames those start with STDMON and end with _company.txt
If you have only digits between STDMON and _company.txt you can do:
grep -o 'STDMON[0-9]*_company\.txt' input.txt | sort -u
See it
And if there can be anything you can do:
grep -oP 'STDMON.*?_company\.txt' input.txt | sort -u
awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++' download_file.html
Input
$ cat downloaded_file.html
1373 STDMON11202010_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON11202010_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON14959440_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON11202010_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON14959440_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON11202010_company.txt<br> Monday, November 22, 2010 1:31 AM
1373 STDMON12342440_company.txt<br> Monday, November 22, 2010 1:31 AM
Output
$ awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++'
STDMON11202010_company.txt
STDMON14959440_company.txt
STDMON12342440_company.txt