Extracting each unique data from a string [closed]

Extracting each unique data from a string [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have been trying to set up REGEX extraction process for the following to no avail.
I have a set of date values in the formats to follow. I need to be able to extract these as unique individual dates.
If there is a single value, it is a standard simple format of mm/dd/yyyy. That one is easy.
If there are more than one date value, then it can be in a format as follows:
Feb 5, 12, 19, 26, Mar 4, 11 2016
I need to turn these into 02/05/2016, 02/12/2016, etc.
Eventually I will be inserting these dates into a database.
Am I going about this in the wrong way? Thanks for advice.

This will be complete spaghetti if you try to do it with one regex:
You will have to hardcode the names of the months and the corresponding numbers somewhere.
The year doesn't follow after the list of days of the month, rather after the list of all month names - month days this year.
However with a little help from a normal programming language you can still get a short and regex-centric solution. Here is a small Ruby snippet to show the general idea:
# this is the input
dates = "Feb 5, 12, 19, 26, Mar 4, 11 2016, Jul 5, 7, 19, 26, May 4, 11 2017"
# a hash with month name => month number
MONTHS = {
'Jan' => '01',
'Feb' => '02',
'Mar' => '03',
'Apr' => '04',
'May' => '05',
'Jun' => '06',
'Jul' => '07',
'Aug' => '08',
'Sep' => '09',
'Oct' => '10',
'Nov' => '11',
'Dec' => '12',
}
# match and extract three things:
# month - the first found month name (three letters)
# days - list of days separated by commas and spaces for this month
# for example 5, 12, 19, 26,
# year - the first found year (four digits)
# ,? is because we don't have , after the last day of the year
while dates =~ /(\w{3}) ((?:\d\d?,? )+).*?(\d{4})/
month, days, year = $1, $2, $3
# to each day collate a date in the wanted format
# MONTHS[month] gets the month number from the hash above
# sprintf simply makes sure that one digit days will have a leading 0
dates_this_month = days.split(/,? /).map do |day|
"#{MONTHS[month]}/#{sprintf('%02d', day)}/#{year}"
end.join ', '
# substitute the dates for this month with the new format
dates.sub! "#{month} #{days}", "#{dates_this_month}, "
end
# remove leftover years
dates.gsub! /, \d{4}/, ''
Now dates is in the desired format.

Assuming that there are no deviations or anomalies in the data that you're RegExing, the following RegEx can be applied with case-sensitivity set and allow you to access the information you need. With RegExs, it's important to "know your data" because this variable can greatly alter the construction of a RegEx -- the balance between specificity and clarity is important since RegExs can easily become unwieldy and cryptic.
Save the months as: ([A-Z][a-z][a-z]) // this can be your $1 variable (useful later)
Save the day values as: \s*(?:([0-9]?[0-9]),\s)* // $2 variable should work for access to this list of values
Save the year values as: ([0-9]{4,4}) // $3 variable should work for accessing these values NOTE: this only works for #### formatted dates by design although it can be altered to handle different formats; I'm just going off of the example you provided
Stringing it all together you get: (?:([A-Z][a-z][a-z])\s*(?:([0-9]?[0-9]),\s)*)+([0-9]{4,4})
You can then construct objects with these values so that you don't end up with a bunch of chaotic data. Let me know if I addressed you problem properly. If there's something that I missed or some additional functionality that you forgot to mention, I will be happy to assist.

Related

For Loop and If Statement not performing as expected

Here's the code:
# Scrape table data
alltable = driver.find_elements_by_id("song-table")
date = date.today()
simple_year_list = []
complex_year_list = []
dateformat1 = re.compile(r"\d\d\d\d")
dateformat2 = re.compile(r"\d\d\d\d-\d\d-\d\d")
for term in alltable:
simple_year = dateformat1.findall(term.text)
for year in simple_year:
if 1800 < int(year) < date.year: # Year can't be above what the current year is or below 1800,
simple_year_list.append(simple_year) # Might have to be changed if you have a song from before 1800
else:
continue
complex_year = dateformat2.findall(term.text)
complex_year_list.append(complex_year)
The code uses regular expressions to find four consecutive digits. Since there are multiple 4 digit numbers, I want to narrow it down to between 1800 and 2021 since that's a reasonable time frame. simple_year_list, however, prints out numbers that don't follow the conditions.

You aren't saving the right value here:
simple_year_list.append(simple_year)
You should be saving the year:
simple_year_list.append(year)
I would need more information to help further though. Maybe give us a sample of the data you're working through, and the output you're seeing?

You can do it all in regex.
Add start ^ and end $ anchors, and range restriction via pattern:
dateformat1 = re.compile(r"^(1[89]\d\d|20([01]\d|2[01]))$")

Matching diverse dates in Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns.
The data are, however, quite messed up.
I have sometimes full dates:
May 30, 1949
Sometimes full dates are combined with other dates and attributes:
May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940
Sometimes you have timespan:
1905-05 OR 1905-1906
Sometimes only the year
1905
Sometimes year with attributes
August or September 1908
Doesn't seems to follow any specific schema or order.
I would like to extract (at least)ca start and end date year, in order to have two columns:
-----------------------
|start_date | end_date|
|1905 | 1906 |
-----------------------
without the rest of the attributes.
I can find the last date using
value.match(/.*(\d{4}).*?/)[0]
and the first one with
value.match(/.*^(\d{4}).*?/)[0]
but I got some trouble with the two formulas.
The latter cannot match anything in case of:
May 30, 1949 and 1951, published 1979
while in the case of:
Paris, winter 1911-12
The latter formula cannot match anything and the former formula match 1911
Anyone know how I can resolve the problem?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date.
Moreover, I would be glad to have some clue about how to extract other information, such as
if published or printed or executed is present in the text -> copy date to a new column name “execution”.
should be something like create a new column
if(value.match("string1|string2|string3" + (\d{4}), "perform the operation", do nothing)

value.match() is a very useful but sometimes tricky function. To extract a pattern from a text, I prefer to use Python/Jython's regular expressions :
import re
pattern = re.compile(r"\d{4}")
return pattern.findall(value)
From there, you can create a string with all the years concatenated:
return ",".join(pattern.findall(value))
Or select only the first:
return pattern.findall(value)[0]
Or the last:
return pattern.findall(value)[-1]
etc.
Same thing for your sub-question:
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
return pattern.findall(value)[0][1]
Or :
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
m = re.search(pattern, value)
return m.group(2)
Example:

Here is a regex which will extract start_date and end_date in named groups :
If there is only one date, then it consider it's the start_date :
((?<start_date>\d{4}).*?)?(?<end_date>\d{4}|(?<=-)\d{2})?$
Demo

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?

I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Python sorting timestamp

I am struggling with something that should be relatively straight forward, but I am getting nowhere.
I have a bunch of data that has a timestamp in the format of hh:mm:ss. The data ranges from 00:00:00 all 24 hours of the day through 23:59:59.
I do not know how to go about pulling out the hh part of the data, so that I can just look at data between specific hours of the day.
I read the data in from a CSV file using:
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
time = row['Time']
This gives me time in the hh:mm:ss format, but now I do not know how to do what I want, which is look at the data from 6am until 6pm. 06:00:00 to 18:00:00.

With the times in 24 hour format, this is actually very simple:
'06:00:00' <= row['Time'] <= '18:00:00'
Assuming that you only have valid timestamps, this is true for all times between 6 AM and 6 PM inclusive.
If you want to get a list of all rows that meet this, you can put this into a list comprehension:
relevant_rows = [row for row in reader if '06:00:00' <= row['Time'] <= '18:00:00']
Update:
For handling times with no leading zero (0:00:00, 3:00:00, 15:00:00, etc), use split to get just the part before the first colon:
> row_time = '0:00:00'
> row_time.split(':')
['0', '00', '00']
> int(row_time.split(':')[0])
0
You can then check if the value is at least 6 and less than 18. If you want to include entries that are at 6 PM, then you have to check the minutes and seconds to make sure it is not after 6 PM.
However, you don't even really need to try anything like regex or even a simple split. You have two cases to deal with - either the hour is one digit, or it is two digits. If it is one digit, it needs to be at least six. If it is two digits, it needs to be less than 18. In code:
if row_time[1] == ':': # 1-digit hour
if row_time > '6': # 6 AM or later
# This is an entry you want
else:
if row_time < '18:00:00': # Use <= if you want 6 PM to be included
# This is an entry you want
or, compacted to a single line:
if (row_time[1] == ':' and row_time > '6') or row_time < '18:00:00':
# Parenthesis are not actually needed, but help make it clearer
as a list comprehension:
relevant_rows = [row for row in reader if (row['Time'][1] == ':' and row['Time'] > '6') or row['Time'] < '18:00:00']

You can use Python's slicing syntax to pull characters from the string.
For example:
time = '06:05:22'
timestamp_hour = time[0:2] #catch all chars from index 0 to index 2
print timestamp_hour
>>> '06'
should produce the first two digits: '06'. Then you can call the int() method to cast them as ints:
hour = int(timestamp_hour)
print hour
>>> 6
Now you have an interger variable that can be checked to see if is between, say, 6 and 18.

What is the best way to populate a load file for a date lookup dimension table?

Informix 11.70.TC4:
I have an SQL dimension table which is used for looking up a date (pk_date) and returning another date (plus1, plus2 or plus3_months) to the client, depending on whether the user selects a "1","2" or a "3".
The table schema is as follows:
TABLE date_lookup
(
pk_date DATE,
plus1_months DATE,
plus2_months DATE,
plus3_months DATE
);
UNIQUE INDEX on date_lookup(pk_date);
I have a load file (pipe delimited) containing dates from 01-28-2012 to 03-31-2014.
The following is an example of the load file:
01-28-2012|02-28-2012|03-28-2012|04-28-2012|
01-29-2012|02-29-2012|03-29-2012|04-29-2012|
01-30-2012|02-29-2012|03-30-2012|04-30-2012|
01-31-2012|02-29-2012|03-31-2012|04-30-2012|
...
03-31-2014|04-30-2014|05-31-2014|06-30-2014|
........................................................................................
EDIT : Sir Jonathan's SQL statement using DATE(pk_date + n UNITS MONTH on 11.70.TC5 worked!
I generated a load file with pk_date's from 01-28-2012 to 12-31-2020, and plus1, plus2 & plus3_months NULL. Loaded this into date_lookup table, then executed the update statement below:
UPDATE date_lookup
SET plus1_months = DATE(pk_date + 1 UNITS MONTH),
plus2_months = DATE(pk_date + 2 UNITS MONTH),
plus3_months = DATE(pk_date + 3 UNITS MONTH);
Apparently, DATE() was able to convert pk_date to DATETIME, do the math with TC5's new algorithm, and return the result in DATE format!
.........................................................................................
The rules for this dimension table are:
If pk_date has 31 days in its month and plus1, plus2 or plus3_months only have 28, 29, or 30 days, then let plus1, plus2 or plus3 equal the last day of that month.
If pk_date has 30 days in its month and plus1, plus2 or plus3 has 28 or 29 days in its month, let them equal the last valid date of those month, and so on.
All other dates fall on the same day of the following month.
My question is: What is the best way to automatically generate pk_dates past 03-31-2014 following the above rules? Can I accomplish this with an SQL script, "sed", C program?
EDIT: I mentioned sed because I already have more than two years worth of data and
could perhaps model the rest after this data, or perhaps a tool like awk is better?

The best technique would be to upgrade to 11.70.TC5 (on 32-bit Windows; generally to 11.70.xC5 or later) and use an expression such as:
SELECT DATE(given_date + n UNITS MONTH)
FROM Wherever
...
The DATETIME code was modified between 11.70.xC4 and 11.70.xC5 to generate dates according to the rules you outline when the dates are as described and you use the + n UNITS MONTH or equivalent notation.
This obviates the need for a table at all. Clearly, though, all your clients would also have to be on 11.70.xC5 too.
Maybe you can update your development machine to 11.70.xC5 and then use this property to generate the data for the table on your development machine, and distribute the data to your clients.
If upgrading at least someone to 11.70.xC5 is not an option, then consider the Perl script suggestion.

Can it be done with SQL? Probably, but it would be excruciating. Ditto for C, and I think 'no' is the answer for sed.
However, a couple of dozen lines of perl seems to produce what you need:
#!/usr/bin/perl
use strict;
use warnings;
use DateTime;
my #dates;
# parse arguments
while (my $datep = shift){
my ($m,$d,$y) = split('-', $datep);
push(#dates, DateTime->new(year => $y, month => $m, day => $d))
|| die "Cannot parse date $!\n";
}
open(STDOUT, ">", "output.unl") || die "Unable to create output file.";
my ($date, $end) = #dates;
while( $date < $end ){
my #row = ($date->mdy('-')); # start with pk_date
for my $mth ( qw[ 1 2 3 ] ){
my $fut_d = $date->clone->add(months => $mth);
until (
($fut_d->month == $date->month + $mth
&& $fut_d->year == $date->year) ||
($fut_d->month == $date->month + $mth - 12
&& $fut_d->year > $date->year)
){
$fut_d->subtract(days => 1); # step back until criteria met
}
push(#row, $fut_d->mdy('-'));
}
print STDOUT join("|", #row, "\n");
$date->add(days => 1);
}
Save that as futuredates.pl, chmod +x it and execute like this:
$ futuredates.pl 04-01-2014 12-31-2020
That seems to do the trick for me.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting each unique data from a string [closed] - regex

Related

For Loop and If Statement not performing as expected

Matching diverse dates in Openrefine

Regular expression and csv | Output more readable

Python sorting timestamp

What is the best way to populate a load file for a date lookup dimension table?

Categories

Resources