Separating a string in Excel VBA - regex

I have a series (thousands and thousands) of call record that I'm trying to create a spreadsheet of. They're all in a text file. The format looks like this:
12/ 13/ 05 Syracuse, NY 10: 22 AM 111- 111- 1111 2 $ - $ - $ -
12/ 13/ 05 New York, NY 10: 28 AM 111- 111- 1111 (F) 2 $ - $ - $ -
12/ 13/ 05 Orlando, FL 10: 48 AM 111- 111- 1111 (F) 4 $ - $ - $ -
3/ 9/ 09 Internal 4: 51 PM 111- 111- 1111 (E) 23 $ - $ - $ -
10/ 14/ 11 Colorado Site 8: 12 AM 111- 111- 1111 14 $ - $ - $ -
1/ 3/ 12 Dept 27 3: 16 PM 111- 111- 1111 (F) 93 $ - $ - $ -
11/ 12/ 12 Internal 3: 13 PM 18765 (E) 16 $ - $ - $ -
11/ 14/ 12 Internal 11: 43 AM John Doe 3 $ - $ - $ -
Month/ day/ year/ city called, STATE HH: MM APM 123- 456 7890 OptionalCode $Charge $Tax $Total
This is, minus details, directly from the file. No quotes around strings, no tabs. I tried to use text to columns, but some cities do have space and others don't.
Anyone want to point me in the right direction? RegEx maybe (Which I've heard of but never used)? Something else?
Update:
Thanks for the early feedback. The line are actual data from my file, though I stripped city and phone numbers. I've updated with the city information to show variance there. To the best I can see, none of the city names have a comma, but I'm dealing with close to 120,000 lines total and, obviously, haven't checked them all.
The city won't always, obviously, have a space - Syracuse above doesnt, New York, however, does. The month and date, too, aren't always 2 digits - which also throws off checks for length. I can read to first, then second forward slash, though - those are fixed after date and month values.
And the bracketed code doesn't always appear... sometimes it's there, sometimes not, though they do appear to only ever be one letter when they arrive.
I hope this clears a few things up. This would have been far easier if it was stored correctly in the first place. Sigh.
Updates 2,3 & 4 Added a few lines from call log changes per Robin's request.

I know you asked for a VBA solution, but I do my call record parsing purely in a spreadsheet with formulae.
I have uploaded a workbook solution here (version 3).
Once you have the workbook open, copy and paste the contents of your text file into cell A2. Then fill down the range B2:X2 as far as necessary.
The formulae will work with any variation in length of month, day , year, city, state, time, code, charge, tax and total.
Let me know if any lines break. You can easily check for these by using the AutoFilter dropdown in the headers to select for errors/extraneous values. Append any offending lines to your question.
Updates:
Version 2 takes care of the situation where the City field contains a location name, and the State field is blank.
Version 3 takes care of the situation where the Phone Number field contains an extension number or name.

Something like this might work if there are no comma's in the city name.
Sub foo()
thisLine = "12/ 13/ 05 City Name, ST 10: 28 AM 111- 111- 1111 (F) 2 $ - $ - $ -"
thisDate = Mid(thisLine, 1, 10)
thisLine = Mid(thisLine, 12)
firstComma = InStr(1, thisLine, ",")
City = Mid(thisLine, 1, firstComma - 1)
thisLine = Mid(thisLine, firstComma + 2)
State = Left(thisLine, 2)
thisLine = Mid(thisLine, 4)
thisTime = Left(thisLine, 9)
thisLine = Mid(thisLine, 11)
thisPhone = Left(thisLine, 14)
thisLine = Mid(thisLine, 16)
tempArray = Split(thisLine, "$")
If UBound(tempArray) = 3 Then
optionalCode = tempArray(0)
charge = "$" & tempArray(1)
tax = "$" & tempArray(2)
Total = "$" & tempArray(3)
Else
' throw an error something went wrong
End If
End Sub

Related

Splitsample in Stata 16: How to create samples based on varying proportions saved in a variable?

Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.
Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:
Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old
Dividing individuals into treatment and controlgroups based on varying chance:
due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups.
To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture).
There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.
Programming
I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)
gen individualnumber=(age-year)+2007
I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:
gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9
gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9
gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)
After that I wanted to construct the samples...
splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show
...but I received an error since only a numlist might be used in the split(numlist) subcommand.
Question: How to construct the samples or overcome this error in an efficient way?
Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals.
Let's state there are 100 individuals like individual 7 in the table.
Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.
What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.
Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end
* Create the var that will display the
gen birthyear = year-age
preserve
* Collapse year-person level data to person level so
* that each individual only get one treatment status.
* You must have an individual id number for this
* Get standard deviation to test that data is good and the birthyear
* is identical for each individual across the panel data set
collapse (mean) birthyear (sd) bysd=birthyear, by(id)
* Test that birthyear is same across all indivudals - this is not needed,
* but good data quality assurance test. Then drop the var as it is not needed
assert bysd == 0
drop bysd
* Set seed to make replicable. Replace this seed when you have tested this
* script using a new random seed. For example from here:
* https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
set seed 123456
*Generate a random number based on the seed
gen random_draw = runiform()
* For each birthyear, get the rank of the random number divided by the number
* of individuals in each birthyear
sort birthyear random_draw
by birthyear : gen percent_rank = _n/_N
*Initiate treatmen variable
gen tmt_status = .
label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"
*Assign birthyear 2006-2004 that are all the same
replace tmt_status = 1 if birthyear == 2006
replace tmt_status = 1 if birthyear == 2005
replace tmt_status = 1 if birthyear == 2004
*Assign birthyear 2003
replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
replace tmt_status = 1 if birthyear == 2003 & percent_rank > .25
*Assign birthyear 2002
replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
replace tmt_status = 1 if birthyear == 2002 & percent_rank > .40
*Fill in birthyear 2001-1999
*Assign year 1998
replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
replace tmt_status = 1 if birthyear == 1998 & percent_rank > .72 & percent_rank <= .86
replace tmt_status = 2 if birthyear == 1998 & percent_rank > .86
*Fill in birthyear 1997-1990
* Do some tabulates etc to convince yourself the randomization is as expected
* Save tempfile of data to be merged to later
* (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
tempfile assignment_results
save `assignment_results'
restore
merge m:1 id using `assignment_results'
This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.
This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.

Using dplyr mutate to find position of character in string

I have data frame with a column of strings, that of an number id followed by "-" and then a month year. I am trying to parse the string to get the month and year. As a very first step, I used dplyr::mutate() and
regexpr()
regexpr("-",yearid)[1]
to create a new column that shows the position of this "-" character. But seems like regexpr() performs very differently inside a mutate(), than when used separately. It does not seem to update depending on the string, but carries over the string position from previous rows. In the example below I expect the position of "-" character to be 4,4, and 5 in the respective yearid. But I get 4,4, and 4 - so this 4 is not correct. When I run regexpr separately I dont see this issue.
Wondering if I am missing something, and how can I get position of "-" dynamically that is specific for each value of yearid? May be there is an easier way to get January, and 1997.
yearid <- c("50 - January 1995","51 - January 1996","100 - January 1997")
data.df <- data.frame(yearid)
data.df <- mutate(data.df, trimpos = regexpr("-",str_trim(yearid))[1],
pos = regexpr("-",yearid)[1])
> data.df
yearid test1 test2
1 50 - January 1995 4 4
2 51 - January 1996 4 4
3 100 - January 1997 4 4
On the other hand using regexpr as such I get the output as expected:
> regexpr("-",yearid[1])[1]
[1] 4
> regexpr("-",yearid[2])[1]
[1] 4
> regexpr("-",yearid[3])[1]
[1] 5
Finally, I have my sessionInfo() below
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 dplyr_0.4.1 readr_0.2.2.9000
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 knitr_1.10.5 lazyeval_0.1.10.9000 magrittr_1.5 parallel_3.1.1
[7] Rcpp_0.11.6 stringi_0.4-1 tools_3.1.1
The regexpr function from the stringr library returns a vector of positions with two additional attributes attached match.length and useBytes. As mentioned in the comments, this vector can be assigned directly to the data frame. This can be done using the mutate function or without.
library(dplyr)
library(stringr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- data.frame(id_month_year, another_column = 1)
## create new column using mutate
data <- data %>% mutate(pos1 = regexpr("-", data$id_month_year))
## create new column without mutate
data$pos2 <- regexpr("-", data$id_month_year)
print(data)
Here are the new columns:
id_month_year another_column pos1 pos2
1 50 - January 1995 1 4 4
2 51 - January 1996 1 4 4
3 100 - January 1997 1 5 5
I would suggest using the separate function from the tidyr library. Here's an example code snippet:
library(dplyr)
library(tidyr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- tbl_df(data.frame(id_month_year, another_column = 1))
clean <- data %>%
separate(
id_month_year,
into = c("id", "month", "year"),
sep = "[- ]+",
convert = TRUE
)
print(clean)
And here's the resulting clean data frame:
Source: local data frame [3 x 4]
id month year another_column
(int) (chr) (int) (dbl)
1 50 January 1995 1
2 51 January 1996 1
3 100 January 1997 1

How to join two txt files, record by record, with fuzzy key? (approximate matching)

I have multiple text files, containing records with multiple fields delimited by tabs, all files have a "key" that is fuzzy (as a name can be, with spelling issues and typos).
File 1: Format: Title \t field1 \t ... fieldn \n
Title Original title Year
21 21 2008
21 Jump Street 21 Jump Street 2012
22 Jump Street 22 Jump Street 2014
27 volte in bianco 27 Dresses 2008
Clerks - Commessi Clerks 2006
...
File 2: Format: Title \t field1 \t ... fieldn \n
Title Director
21 Rob
21 Jump Street Lord&Miller
22 Jump Street Lord,Miller
27 volte in bianco Fletcher
Clerks: Commessi Smith
...
File 3: Format: Title \t field1 \t ... fieldn \n
Title Filename
21 "21.mkv"
21 Jump Street "21 Jump St.avi"
27 volte in bianco "27 Dresses.avi"
Clerks - Commessi "Clerks.avi"
File n: Format: Title \t field1 \t ... fieldn \n
Title Descripted in
21 "21.mht"
21 Jump Street "21JS.mht"
22 Jump Street "22.mht"
27 volte in bianco "27dres.mht"
Clerks - Commessi "Clerks.mht"
I would to create an output that orderly join all records (including incomplete and unmatched) of all files, using Title as key, but allowing a little differences between keys (see how clerks use : instead of - in file 2), ideally giving a warning when matching is not exact (char by char):
Output: Format field1 \t field2 \t ... fieldn \n
Warning Title Original title Year Director Filename Description
No 21 21 2008 Rob "21.mkv" "21.mht"
No 21 Jump Street 21 Jump Street 2012 Lord&Miller "21 JS.avi" "21JS.mht"
No 22 Jump Street 22 Jump Street 2014 Lord,Miller "22.mht"
No 27 volte in bianco 27 Dresses 2008 Fletcher "27 Dress.avi" "27dress.mht"
Yes Clerks - Commessi Clerks 2006 Smith "clerks.avi" "Clerks.mht"
How a fuzzy match can be done and marked by warning=Yes? (as with similar but not equal key for Clerks in file 2) and how to manage missing records (note that 3rd record (22 Jump street) do not have a record in file 3, thus all missing fields must be substituted by tabs only (\t) in the output file to mantain correct column order in all records).
The most difficult parts are these two, at least for me.
Any suggestion? Any tool suggested for the job ?

space delimited file handling

I have insider transactions of a company in a space delimited file. Sample data looks like the following:
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26
Col1 is Serial# that I dont need to print
Col2 is the name of the person who did trades. This column is not consistent. It has first name and second name and middle initial and for some insiders salutations as well (Mr, Dr. Jr etc)
col3 is the date format Month Day,Year
col4 is the number of shares traded
col5 is the price at which shares were purchased or sold.
I need you guys help to print each column value separately. Thanks for your help.
Count the total number of fields read; the difference between that and the number of non-name fields gives you the width of the name.
#!/bin/bash
# uses bash features, so needs a /bin/bash shebang, not /bin/sh
# read all fields into an array
while read -r -a fields; do
# calculate name width assuming 5 non-name fields
name_width=$(( ${#fields[#]} - 5 ))
cur_field=0
# read initial serial number
ser_id=${fields[cur_field]}; (( ++cur_field ))
# read name
name=''
for ((i=0; i<name_width; i++)); do
name+=" ${fields[cur_field]}"; (( ++cur_field ))
done
name=${name# } # trim leading space
# date spans two fields due to containing a space
date=${fields[cur_field]}; (( ++cur_field ))
date+=" ${fields[cur_field]}"; (( ++cur_field ))
# final fields are one span each
num_shares=${fields[cur_field]}; (( ++cur_field ))
price=${fields[cur_field]}; (( ++cur_field ))
# print in newline-delimited form
printf '%s\n' "$ser_id" "$name" "$date" "$num_shares" "$price" ""
done
Run as follows (if you saved the script as process):
./process <input.txt >output.txt
It might be a little easier in perl.
perl -lane '
#date = splice #F, -4, 2;
#left = splice #F, -2, 2;
splice #F, 0, 1;
print join "|", "#F", "#date", #left
' file
Gilliland Michael S|January 2,2013|20,000|19
Still George J Jr|January 2,2013|20,000|19
Bishkin S. James|February 1,2013|150,000|21
Mellin Mark P|May 28,2013|238,000|25.26
You can change the delimiter in the join as per your requirement.
Here is the data separated using awk
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file
1|Gilliland Michael S|January 2,2013|20,000|19
2|Still George J Jr|January 2,2013|20,000|19
3|Bishkin S. James|February 1,2013|150,000|21
4|Mellin Mark P|May 28,2013|238,000|25.26
You know have your data in variable c1 to c5
Or better displayed here:
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file | column -t -s "|"
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26

RegEx String Validator

In my MVC3 application, on one of the entities I am saving the Date of Birth as a string. Why ? because my application allows the storage of the date of birth of people long dead, eg. Socrates, Plato, Epicurus ... etc and as far as I know the DateTime class doesn't allow that.
Now obviously we don't know the exact date of birth of Epicurus for example, we just know the year of birth [ 341 BCE ], so what I am thinking of doing is building a custom validator, that will validate the input string for the Date of Birth and make sure that they all match the following format:
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
I need to write a regular expression that will match any of the above, and of course not match anything else.
Update
Thank you very much, I wish I was as good as you lot in building RegExes ! Since my application is with ASP.net MVC3, I would like to stick with the .NET RegEx class (for convenience's sake).
luastoned answer seems to work; I can't seem to break its logic with all the test data I've thrown at it.
One thing though, can I also allow BC? Because some people use BC and others use BCE < would that be possible? And, am I right that the regular expression can not replace BC with BCE? I have to do that manually through my C# code - the RegEx would just either match or not, is that correct?
Update 2
M42's Regular Expression seems to be working better. I've just copied it and used it in my Custom Validator (code in PasteBin link below).
How about :
/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/
Here is a perl script with test cases:
#!/usr/local/bin/perl
use strict;
use warnings;
use Test::More;
my $re1 = qr/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/;
while(<DATA>) {
chomp;
if (/$re1/) {
ok(1, "input = $_");
} else {
ok(0, "input = $_");
}
}
done_testing;
__DATA__
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
12D09
1s909
A3 43 4 BCE
a 1
3F9
abc
BCE
123b456
output:
# Looks like you failed 9 tests of 15.
ok 1 - input = 12 Feb 1809
ok 2 - input = Feb 1809
ok 3 - input = 341
ok 4 - input = 341 BCE
ok 5 - input = Oct 341 BCE
ok 6 - input = 11 Mar 5 BCE
not ok 7 - input = 12D09
not ok 8 - input = 1s909
not ok 9 - input = A3 43 4 BCE
not ok 10 - input = a 1
not ok 11 - input = 3F9
not ok 12 - input =
not ok 13 - input = abc
not ok 14 - input = BCE
not ok 15 - input = 123b456
this looks likes the weirdest Regex I've ever made:
(\d+\s?)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?\s(\d+\s)?(BCE)?
I have no idea how many false positives would go through though..
You can check the sample on Regexr
Not quite a friend of RegExr (and not knowing the limitations of regexes in MVC3), allow me to present a PHP version with named captures (demo):
(?:(?:(?P<date>\d{1,2})\s)?(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))?(?:(?:^|\s)(?P<year>\d+))?(?:\s(?P<bce>BCE))?
This is based on #luastoned's answer.