How can find some pin point with regex - regex

I am trying to analyze a string with regex (e.g. 20, 38,, 20, 24 n2,, 20, 28, 38,, 851, 859 n3,) in XML files.
Example text:
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 38, 111 S Ct 1647:</p>
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 24 n2, 111 S Ct 1647</p>
<p>Gilmer v Interstate/Johnson Lane Corp.</italic> (1991) 500 US 20, 28, 38, 111 S Ct 1647</p>
<p>International Bhd. of Elec. Workers v Hechler (1987) 481 US 851, 859 n3, 107 S Ct 2161:</p>
I want to modify the (\([^()]*)|([0-9]+,)\s*[0-9]+,?\s*[0-9]+, regex because I am replacing the text with $1$2.
(https://regex101.com/r/jWt2w1/2)

Use
(\([^()]*)|([0-9]+,)\s*[0-9]+(?:\s+[a-z]+)?,?\s*[0-9]+(?:\s+[a-z]+)?,
See proof
The (?:\s+[a-z]+)? optionally matches one or more whitespace characters and one or more letters.

Related

Python printing lists with column headers

So I have a nested list containing these values
#[[Mark, 10, Orange],
#[Fred, 15, Red],
#[Gary, 12, Blue],
#[Ned, 21, Yellow]]
You can see that the file is laid out so you have (name, age, favcolour)
I want to make it so I can display each column with its corresponding header
E.G
Name|Age|Favourite colour
Mark|10 |Orange
Fred|15 |Red
Gary|12 |Blue
Ned |21 |Yellow
Thank You!
Simple solution using str.format() function:
l = [['Mark', 10, 'Orange'],['Fred', 15, 'Red'],['Gary', 12, 'Blue'],['Ned', 21, 'Yellow']]
f = '{:<10}|{:<3}|{:<15}' # format
# header(`Name` column has some gap as there could be long names, like "Cristopher")
print('Name |Age|Favourite colour')
for i in l:
print(f.format(*i))
The output:
Name |Age|Favourite colour
Mark |10 |Orange
Fred |15 |Red
Gary |12 |Blue
Ned |21 |Yellow

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?
You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address

Repeating Capture Groups Regex

I have a large chunk of class data that I need to run a regular expression on and get data back from. The problem is that I need a repeating capturing group in order to acomplish that.
Womn St 157A QUEERHISTORY MAKING
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32680 LEC A 4 SHAH, P. TuTh 11:00-12:20p IAB 131 35 37 60 FULL
Womn St 171 SEX/RACE & CONQUEST
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32710 LEC A 4 O'TOOLE, R. TuTh 2:00- 3:20p DBH 1300 52 13/45 24 OPEN
~ Same as 25610 (GlblClt 103B, Lec A); 26350 (History 169, Lec A); and
~ 60320 (Anthro 139, Lec B).
32711 DIS 1 0 MONSON, A. W 9:00- 9:50 HH 105 25 5/23 8 OPEN
O'TOOLE, R.
~ Same as 25612 (GlblClt 103B, Dis 1); 26351 (History 169, Dis 1); and
~ 60321 (Anthro 139, Dis 1).
The result I need would return two matches
Match
Group1:Womn St 157A
Group2:QUEERHISTORY MAKING
Group3:32680
Group4:LEC
Group5:A
Group6:SHAH, P.
Group7:TuTh 11:00-12:20p
Group8:IAB 13
Match
Group1:Womn St 171
Group2:SEX/RACE & CONQUEST
Group3:32710
Group4:LEC
Group5:A
Group6:O'TOOLE, R.
Group7:TuTh 2:00- 3:20p
Group8:DBH 1300
Group9:25610
Group10:26350
Group11:60320
Group12:32711
Group13:DIS
Group14:1
Group15:MONSON, A.
Group16: W 9:00- 9:50
Group17:HH 105
Group18:25612
Group19:26351
Group20:60321

How to extract character component in values and replace the values with -99

My data looks like:
VAR_A: 134, 15M3, 2004, 301ME, 201E, 41, 53, 22
I'd like to change this vector like below:
VAR_A: 134, -99, 2004, -99, -99, 41, 53, 22
If a value contain characters (e.g., M, E), I want to change those values with -99.
How could I do it in R? I've heard that regular expression would be a possible way, but I'm not good at it.
It seems to me you want to replace the values that are not digits, if that is the case ...
x <- c('134', '15M3', '2004', '301ME', '201E', '41', '53', '22')
sub('.*\\D.*', '-99', x)
# [1] "134" "-99" "2004" "-99" "-99" "41" "53" "22"
Or essentially you could do:
x[grepl('\\D', x)] <- -99
as.numeric(x)
# [1] 134 -99 2004 -99 -99 41 53 22

ReshapeError while trying to pivot pandas dataframe

Using pandas 0.11 on python 2.7.3 I am trying to pivot a simple dataframe with the following values:
StudentID QuestionID Answer DateRecorded
0 1234 bar a 2012/01/21
1 1234 foo c 2012/01/22
2 4321 bop a 2012/01/22
3 5678 bar a 2012/01/24
4 8765 baz b 2012/02/13
5 4321 baz b 2012/02/15
6 8765 bop b 2012/02/16
7 5678 bop c 2012/03/15
8 5678 foo a 2012/04/01
9 1234 baz b 2012/04/11
10 8765 bar a 2012/05/03
11 4321 bar a 2012/05/04
12 5678 baz c 2012/06/01
13 1234 bar b 2012/11/01
I am using the following command:
df.pivot(index='StudentID', columns='QuestionID')
But I am getting the following error:
ReshapeError: Index contains duplicate entries, cannot reshape
Note that the same dataframe without the last line
13 1234 bar b 2012/11/01
The pivot results successfully in following:
Answer DateRecorded
QuestionID bar baz bop foo bar baz bop foo
StudentID
1234 a b NaN c 2012/01/21 2012/04/11 NaN 2012/01/22
4321 a b a NaN 2012/05/04 2012/02/15 2012/01/22 NaN
5678 a c c a 2012/01/24 2012/06/01 2012/03/15 2012/04/01
8765 a b b NaN 2012/05/03 2012/02/13 2012/02/16 NaN
I am new to pivoting and would like to know why having duplicate StudentID, QuestionID pair causing this problem? And, how can I fix this using the df.pivot() function?
thank you.
What do you expect your pivot table to look like with the duplicate entries? I'm not sure it would make sense to have multiple elements for (1234, bar) in the pivot table. Your data looks like it's naturally indexed by (questionID, studentID, dateRecorded).
If you go with the Hierarchical Index approach (they're really not that complicated!) I'd try:
In [104]: df2 = df.set_index(['StudentID', 'QuestionID', 'DateRecorded'])
In [105]: df2
Out[105]:
Answer
StudentID QuestionID DateRecorded
1234 bar 2012/01/21 a
foo 2012/01/22 c
4321 bop 2012/01/22 a
5678 bar 2012/01/24 a
8765 baz 2012/02/13 b
4321 baz 2012/02/15 b
8765 bop 2012/02/16 b
5678 bop 2012/03/15 c
foo 2012/04/01 a
1234 baz 2012/04/11 b
8765 bar 2012/05/03 a
4321 bar 2012/05/04 a
5678 baz 2012/06/01 c
1234 bar 2012/11/01 b
In [106]: df2.unstack('QuestionID')
Out[106]:
Answer
QuestionID bar baz bop foo
StudentID DateRecorded
1234 2012/01/21 a NaN NaN NaN
2012/01/22 NaN NaN NaN c
2012/04/11 NaN b NaN NaN
2012/11/01 b NaN NaN NaN
4321 2012/01/22 NaN NaN a NaN
2012/02/15 NaN b NaN NaN
2012/05/04 a NaN NaN NaN
5678 2012/01/24 a NaN NaN NaN
2012/03/15 NaN NaN c NaN
2012/04/01 NaN NaN NaN a
2012/06/01 NaN c NaN NaN
8765 2012/02/13 NaN b NaN NaN
2012/02/16 NaN NaN b NaN
2012/05/03 a NaN NaN NaN
Otherwise you can come up with some rule to determine which of the multiple entries to take for the pivot table, and avoid the Hierarchical index.
Instead of relying on Pandas (which is better of course) you could also aggregate your data manually.
def heatmap_seaborn():
na_lr_measures = [50, 50, 50, 49, 49, 49, 48, 47, 47, 47, 46, 46, 46, 46, 45, 45, 45, 45, 45, 45, 45, 45, 45, 43, 43, 43, 43, 42, 42, 42, 41, 41, 41, 41, 41, 41, 41, 40, 40, 40, 40, 40, 40, 40, 40, 39, 39, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 34, 34, 34, 33, 33, 33, 32, 32, 31, 30, 30, 30, 29, 29]
na_lr_labels = ('bi2e', 'bi21', 'bi22', 'si21', 'si22', 'si2e', 'si11', 'bi11', 'bi1e', 'si1e', 'bx21', 'ti22', 'bx2e', 'si12', 'ti1e', 'sx22', 'ti21', 'bx22', 'sx2e', 'bi12', 'ti11', 'sx21', 'ti2e', 'ti12', 'sx11', 'sx1e', 'bxx2', 'bx1e', 'bx11', 'tx2e', 'tx22', 'tx21', 'sx12', 'six1', 'six2', 'sixe', 'sixx', 'tx11', 'bx12', 'bix2', 'bix1', 'tx1e', 'bixe', 'bixx', 'bxxe', 'sxx2', 'tx12', 'tixe', 'tix1', 'sxxe', 'sxx1', 'si1x', 'tixx', 'bxx1', 'tix2', 'bi2x', 'sxxx', 'si2x', 'txx1', 'bxxx', 'txxe', 'ti2x', 'sx2x', 'bx2x', 'txxx', 'bi1x', 'tx1x', 'sx1x', 'tx2x', 'txx2', 'bx1x', 'ti1x')
na_lr_labelcategories = ["TF", "IDF", "Normalisation", "Regularisation", "Acc#161"]
measures = na_lr_measures
labels = na_lr_labels
cats = na_lr_labelcategories
new_measures = defaultdict(list)
new_labels = []
#cats = ["TF", "Normalisation", "Acc#161"]
for i,c in enumerate(labels):
c=c[0]+c[2]
new_labels.append(c)
m = measures[i]
new_measures[c].append(m)
labels = list(set(new_labels))
measures = []
for l in labels:
m = np.mean(new_measures[l])
measures.append(m)
df = pd.DataFrame(
{cats[0]:pd.Categorical([a[0] for a in labels]),
#cats[1]:pd.Categorical([a[1] for a in labels]),
cats[2]:pd.Categorical([a[1] for a in labels]),
#cats[3]:pd.Categorical([a[3] for a in labels]),
cats[4]:measures})
print df
df = df.pivot(cats[0], cats[2], cats[4])
sns.set_context("paper",font_scale=2.7)
fig, ax = plt.subplots()
ax = sns.heatmap(df)
plt.show()
as you can see in the example a pandas dataframe is built from some arrays then the table is manually aggregated.
I did this because I didn't have the time to learn more pandas.