Extracting only dates from a text file and ignoring large numbers - regex

I have a text file and I want to extract all dates from it but somehow my code is also extracting the other values like
Procedure #: 10075453.
Below is a small sample of that file:
Patient Name: Mills, John Procedure #: 10075453
October 7, 2017
Med Rec #: 747901 Visit ID: 110408731
Patient Location: OUTPATIENT Patient Type: OUTPATIENT
DOB:07/09/1943 Gender: F Age: 73Y Phone: (321)8344-0456
Can I get an idea how I could approach this problem?
doc = []
with open('Clean.txt', encoding="utf8") as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_extract():
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

Related

References not sorted properly by author-year

I'm working on a markdown Rmd document with references in several .bib BibTeX databases. The yaml header includes:
---
title: "title"
author: "me"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
bookdown::word_document2:
reference_docx: StylesTemplate.docx
number_sections: false
bibliography:
- "`r system('kpsewhich graphics.bib', intern=TRUE)`"
- "`r system('kpsewhich statistics.bib', intern=TRUE)`"
- "`r system('kpsewhich timeref.bib', intern=TRUE)`"
- "`r system('kpsewhich Rpackages.bib', intern=TRUE)`"
csl: apa.csl
---
I am stymied in how to get the following references to sort in the correct author-year order. The first two are out of order.
I am aware that pandoc-citeproc with a .csl file attempts to disambiguate authors when there are different spellings, but I checked my .bib files and all of these Tukey publications have one of:
author = {John W. Tukey}
author = {Tukey, John W.}
so they should be considered the same.
The first 4 references in my BibTeX files are:
#InProceedings{Tukey:1975:picturing,
author = {John W. Tukey},
booktitle = {Proceedings of the International Congress of Mathematicians, Vancouver},
title = {Mathematics and the picturing of data},
year = {1975},
pages = {523--531},
volume = {2},
}
#Techreport{Tukey:1993:TR,
author = "John W. Tukey",
title = "Exploratory Data Analysis: Past, Present, and Future",
institution = "Department of Statistics, Princeton University",
year = "1993",
number = "No. 302",
month = apr,
url = "https://apps.dtic.mil/dtic/tr/fulltext/u2/a266775.pdf",
}
#Article{Tukey:59,
author = {John W. Tukey},
journal = {Technometrics},
title = {A Quick, Compact, Two Sample Test to {Duckworth's} Specifications},
year = {1959},
pages = {31--48},
volume = {1},
doi = {10.2307/1266308},
url = {https://www.jstor.org/stable/1266308},
}
#article{Tukey:1962,
Author = {John W. Tukey},
Journal = {The Annals of Mathematical Statistics},
Number = {1},
Pages = {1--67},
Publisher = {Institute of Mathematical Statistics},
Title = {The Future of Data Analysis},
Url = {http://www.jstor.org/stable/2237638},
Volume = {33},
Year = {1962},
}
I see minor differences in formatting, but these should not affect pandoc-citeproc sorting.
Is this perhaps a bug in pandoc-citeproc or is there something I can do in my .bib files to avoid this?
I'm running R 4.1.3 under R Studio 2022.02.1, with pandoc 2.17.1.1
Update
I re-ran this using the chicago-author-date.csl style. All the references now sort correctly, so there must be something peculiar with the apa.csl style. I'd still prefer apa.csl style, so it would be of interest to understand why the difference.

How do I sort through lines of a text document to find a phrase based on a date?

I am writing a program that logs jobs into a file and then sorts and organises the jobs by date. The entries are lists that are just appended to the end of a text file. They appear in the file like so:
2017-01-31,2016-05-24,test1
2016-05-15,2016-05-24,test2
2016-06-15,2016-05-24,test3
2016-07-16,2016-05-24,test4
They follow this format: due date, date entered, job title. I would like to be able to be able to print the jobs from the text file to the python shell by order of dates, the job with the closest date being first. I was thinking of turning each line into an item in a list, doing something with the due date characters, and sorting that way. I can't figure out how to keep everything together if I do it that way though. Any thoughts?
Use datetime.datetime.strptime to parse the date strings into datetime objects. Then just sort the list of jobs by date and output them.
from datetime import datetime
date_str_format = '%Y-%m-%d'
jobs = []
with open('jobs.txt', 'r') as f:
for line in f:
date_due, date_entered, title = line.split(',')
jobs.append((datetime.strptime(date_due, date_str_format),
datetime.strptime(date_entered, date_str_format),
title.strip()))
jobs.sort()
for date_due, _, title in jobs:
print '{} (due {})'.format(title, date_due)
Here are the contents of jobs.txt:
2017-01-31,2016-05-24,test1
2016-05-15,2016-05-24,test2
2016-06-15,2016-05-24,test3
2016-07-16,2016-05-24,test4
And the output...
test2 (due 2016-05-15 00:00:00)
test3 (due 2016-06-15 00:00:00)
test4 (due 2016-07-16 00:00:00)
test1 (due 2017-01-31 00:00:00)
I think this does what you want since you've picked a nice date format:
lines = """
2017-01-31,2016-05-24,test1
2016-05-15,2016-05-24,test2
2016-06-15,2016-05-24,test3
2016-07-16,2016-05-24,test4
"""
sorted([entry.split(",") for entry in lines.split("\n") if any(entry)], reverse=True)
In Python 2.7 shell:
>>> lines = """
... 2017-01-31,2016-05-24,test1
... 2016-05-15,2016-05-24,test2
... 2016-06-15,2016-05-24,test3
... 2016-07-16,2016-05-24,test4
... """
>>>
>>> lines_sorted = sorted([entry.split(",") for entry in lines.split("\n") if any(entry)], reverse=True)
>>> for line in lines_sorted:
... print line
...
['2017-01-31', '2016-05-24', 'test1']
['2016-07-16', '2016-05-24', 'test4']
['2016-06-15', '2016-05-24', 'test3']
['2016-05-15', '2016-05-24', 'test2']
>>>
Using string formatting and unpacking using *:
output_str = "Due date: {0}\nDate entered: {1}\nJob title: {2}\n"
entries_sorted = sorted([entry.split(",") for entry in lines.split("\n") if any(entry)], reverse=True)
for entry in entries_sorted:
print output_str.format(*entry)
Output:
Due date: 2017-01-31
Date entered: 2016-05-24
Job title: test1
Due date: 2016-07-16
Date entered: 2016-05-24
Job title: test4
Due date: 2016-06-15
Date entered: 2016-05-24
Job title: test3
Due date: 2016-05-15
Date entered: 2016-05-24
Job title: test2

PIG - Create a tuple line from flat file

I'm trying to create a tuple in Pig, but the format of file is not much friendly:
File Format:
Name: Zach
LastName: Red
Address: 34 Store Av
Age: 34
Name: Brian
LastName: Curts
Address: 123 Street Av
Age: 23
I need to create a tuple:
Name: Zach LastName: Red Address: 34 Store Av Age: 34
Name: Brian LastName: Curts Address: 123 Street Av Age: 23
You can write your own UDF in Java/Python/... to load this data. Check doc:
http://pig.apache.org/docs/r0.15.0/udf.html#load-store-functions
Crazy idea, but it might work; I ASSUME all your elements have 4 rows. Otherwise - it won't work.
Load the file using PigStorage
Use the RANK operator to generate a RANK field for each row. First row would get 1, 2nd row would get 2, etc.
For each row generate another number, between 1-4, based on its type: 1 for name, 2 for LastName, 3 for Address, 4 for Age. Let's call it 'RecordType'
add another field which would be FLOOR((RANK-1)/4). Name it 'PersonID'. For the 1st person it would be 0, for the 2nd one it would be 1, etc.
Now you can group by PersonID to get all records for the same person 'together'.
Now, for each person you would get the PersonID, and a bag containing all the records. We need to get them sorted. For that purpose you can use
output = foreach Person {
sorted = order PersonRows by RecordType;
generate PersonID,sorted;
}
Flatten the Bag into a Tuple using the BagToTuple function
and you're done.

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!
There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺
my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.
Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;
Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.
Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

Search then Extract

I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.
If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)
In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!