Repeating Capture Groups Regex - regex

I have a large chunk of class data that I need to run a regular expression on and get data back from. The problem is that I need a repeating capturing group in order to acomplish that.
Womn St 157A QUEERHISTORY MAKING
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32680 LEC A 4 SHAH, P. TuTh 11:00-12:20p IAB 131 35 37 60 FULL
Womn St 171 SEX/RACE & CONQUEST
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32710 LEC A 4 O'TOOLE, R. TuTh 2:00- 3:20p DBH 1300 52 13/45 24 OPEN
~ Same as 25610 (GlblClt 103B, Lec A); 26350 (History 169, Lec A); and
~ 60320 (Anthro 139, Lec B).
32711 DIS 1 0 MONSON, A. W 9:00- 9:50 HH 105 25 5/23 8 OPEN
O'TOOLE, R.
~ Same as 25612 (GlblClt 103B, Dis 1); 26351 (History 169, Dis 1); and
~ 60321 (Anthro 139, Dis 1).
The result I need would return two matches
Match
Group1:Womn St 157A
Group2:QUEERHISTORY MAKING
Group3:32680
Group4:LEC
Group5:A
Group6:SHAH, P.
Group7:TuTh 11:00-12:20p
Group8:IAB 13
Match
Group1:Womn St 171
Group2:SEX/RACE & CONQUEST
Group3:32710
Group4:LEC
Group5:A
Group6:O'TOOLE, R.
Group7:TuTh 2:00- 3:20p
Group8:DBH 1300
Group9:25610
Group10:26350
Group11:60320
Group12:32711
Group13:DIS
Group14:1
Group15:MONSON, A.
Group16: W 9:00- 9:50
Group17:HH 105
Group18:25612
Group19:26351
Group20:60321

Related

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

Error reading last line of file and inputing into Linklist

So I'm working on an assignment for my class in which I have to read in data from a file and create a doubly linked list with it. I have all the difficult stuff done, now I'm just running into the problem where my program throws a bunch of random characters and kills itself on the last line.
Here is the function that is reading in the data and inserting it into my link list. My professor wrote this, so to be frank, I don't understand it very well.
void PropogateTheList(ifstream & f, LinkList & ML)
{
static PieCake_struct * record;
record= new PieCake_struct;
f>>record->id>>record->lname>>record->fname>>record->mi>>record->sex>>record->pORc;
while(!f.eof())
{
ML.Insert(record);
record = new PieCake_struct;
f>>record->id>>record->lname>>record->fname>>record->mi>>record->sex>>record->pORc;
}
}
Here is the data that is being propagated:
1 Abay Harege N O C
2 Adcock Leand R F P
3 Anderson Logan B M P
5 Bautista Gloria A F P
10 Beckett Dallas B F C
12 Ambrose Bridget C F C
13 Beekmann Marvin D M P
14 Bacaner Tate D M C
16 Bis Daniel F M P
18 Dale Kaleisa G F C
19 DaCosta Ricardo H M P
23 Adeyemo Oluwanifemi I M C
24 Berger Chelsea J F C
38 Daniels Jazmyn K F P
39 Davis Takaiyh L F C
40 DeJesus Gabriel M M P
51 Castro Floriana N F P
52 Chen Justin O M C
53 Clouden Ariel P F P
54 Conroy Cameron Q M C
61 Contreras Dominic R M P
62 Cooley Kyle S M C
63 Creighton Cara T F P
64 Cullen William U M C
66 Blakey Casey V M C
67 Barbosa Anilda W F P
83 Brecher Benjamin X M P
84 Boulos Alexandre Y F C
85 Barrios Joshua Z M C
85 Bonaventura Nash A M P
86 Bohnsack David B M C
87 Blume Jeffrey C M P
90 Burgman Megan D F C
91 Bursic Gregory E M P
92 Calvo Sajoni F F C
93 Cannan Austin G M P
94 Carballo Nicholas H M C
99 AlbarDiaz Matias I F P
Currently, I sort the data alphabetically based off the last name, so on about the 5th line, when it tries to print out number 99 (AlabraDiaz) it dies. If I sort the list another way, the program always messes up with whatever the last line of data is. Any help would be great!
UPDATE:
So I've tried implementing an if(!.eof()) before inserting but it unfortunately doesn't do anything. I deleted the last of data, thus making the person Carballo. This is what my function prints out:
****** The CheeseCake Survey ******
Id Last Name First Name MI Sex Pie/Cake
-- -------- --------- -- --- --------
2 Adcock
23 Adeyemo
12 Ambrose
3 Anderson
14 Bacaner
67 Barbosa
85 Barrios
5 Bautista
10 Beckett
13 Beekmann
24 Berger
16 Bis
66 Blakey
87 Blume
86 Bohnsack
85 Bonaventura
84 Boulos
83 Brecher
90 Burgman
91 Bursic
92 Calvo
93 Cannan
0 Carballo????NicholasA8?zL`8?zL`A8?zL`8?zL`??
Wouldn't it be better if you'd read from the stream first, then check if it's in eof state and based on that you'd insert the element? I'm writing the following code without compiler's help, here in the edit box, so apologises if I'd made any mistake. Of course the question arises or something I'd think of is what will happen if you'll try to read from your f stream in case where eof results in true. To read more about it you can check the following link:
Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?
void PropogateTheList(ifstream & f, LinkList & ML)
{
while(!f.eof())
{
static PieCake_struct * = new PieCake_struct;
f>>record->id>>record->lname>>record->fname>>record->mi>>record->sex>>record->pORc;
if(!f.eof())
ML.Insert(record);
}
}

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?
You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?
This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.
You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64
SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

Remove Data from Address Line

I have the following address that I pulled from a database. I am trying to clear everything up until ST|AVE|BLVD. I am trying to get rid of 1ST or the random 1.
9999-1000 N CLARK ST 1 1
4567-5678 W BELMONT AVE
1200 N HAMLIN AVE 1ST 1
8220 W CERMAK RD 1ST
1240 W 69TH ST 1ST
7901 W ADDISON ST 1ST
So that it reads:
1. 9999-1000 N CLARK ST
2. 4567-5678 W BELMONT AVE
3. 1200 N HAMLIN AVE
4. 8220 W CERMAK RD
5. 1240 W 69TH ST
6. 7901 W ADDISON ST
You can try the following regex:
^(.*?)(\s*(?:ST|AVE|BLVD).*)$
Your data is in capturing group 1.
See example here.