Regex to find specific pattern in R - regex

I have a dataset like below:
dput(d1)
structure(list(FNUM = structure(1L, .Label = "20140824-0227", class = "factor"),
DESCRIPTION = "From : J LTo : feedback#lsd.goe.sfcc : Bcc : Sent On : Mon Apr 13 08:59:18 S 2015Subject : RE:Re: Suspect illegally modified vehiclesBody : Our Ref: BS-CT-1408-0665Date : 2-Apr-2015Our Ref: 2015/Jan/3224Date : 2-Apr-2015Thank you very much! Please conduct a thorough check on the vehicle other than the exhaust system. Warm regards,J L--------------------------------------------On Mon, 4/13/15, feedback#lsd.goe.sf <feedback#lsd.goe.sf> wrote: Subject: RE:Re: Suspect illegally modified vehicles To: jl1229#yahoo.ca Received: Monday, April 13, 2015, 8:56 AM Our Ref: GCE/VS/VS/VE/F20.000.000/38104 Date : 8-Apr-2015 Tel : 1800 2255 582 Fax : 6553 5329 -------------------------------------------- On Mon, 4/6/15, feedback#lsd.goe.sf <feedback#lsd.goe.sf> wrote: Subject: Suspect illegally modified vehicles To: joa#dccs.ca Received: Monday, April 6, 2015, 11:06 AM Our Ref: GCE/VS/VS/VE/F20.000.000/37661 Date : 2-Apr-2015 Tel : 1812 2235 582 Fax : 6553 5329 Dear Ms L Our records show that the vehicle bearing registration"), .Names = c("FNUM",
"DESCRIPTION"), row.names = "1", class = "data.frame")
I use the below regex to identfiy values Our Ref:
> gsub(" *(Our Ref|Date) *:? *","",regmatches(d1[1,2],gregexpr("Our Ref *:[^:]+",d1[1,2]))[[1]])
[1] "BS-CT-1408-0665" "2015/Jan/3224"
[3] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/VS/VS/VE/F20.000.000/37661"
But i only wanted values of Our Ref: which starts with GCE , how do i limit my output to those values which begins with GCE.
Desired Result:
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/VS/VS/VE/F20.000.000/37661"
Updated For Second part of the problem:
dput(d1)
structure(list(FNUM = structure(1L, .Label = "20140824-0227", class = "factor"),
DESCRIPTION = "From : J LTo : feedback#lsd.goe.sfcc : Bcc : Sent On : Mon Apr 13 08:59:18 S 2015Subject : RE:Re: Suspect illegally modified vehiclesBody : Our Ref: BS-CT-1408-0665Date : 2-Apr-2015Our Ref: 2015/Jan/3224Date : 2-Apr-2015Thank you very much! Please conduct a thorough check on the vehicle other than the exhaust system. Warm regards,J L--------------------------------------------On Mon, 4/13/15, feedback#lsd.goe.sf <feedback#lsd.goe.sf> wrote: Subject: RE:Re: Suspect illegally modified vehicles To: jl1229#yahoo.ca Received: Monday, April 13, 2015, 8:56 AM Our Ref: GCE/VS/VS/VE/F20.000.000/38104 Date : 8-Apr-2015 Tel : 1800 2255 582 Fax : 6553 5329 -------------------------------------------- On Mon, 4/6/15, feedback#lsd.goe.sf <feedback#lsd.goe.sf> wrote: Subject: Suspect illegally modified vehicles To: joa#dccs.ca Received: Monday, April 6, 2015, 11:06 AM Our Ref: GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc Date : 2-Apr-2015 Tel : 1812 2235 582 Fax : 6553 5329 Our Ref: GCE/CC/PCF/FB/F20.000.000/233546/SK/PW Date : 2-Apr-2015 Dear Ms L Our records show that the vehicle bearing registration "), .Names = c("FNUM",
"DESCRIPTION"), row.names = "1", class = "data.frame")
> gsub(" *(Our Ref|Date) *:? *","",regmatches(d1[1,2],gregexpr("Our Ref *:\\s+GCE[^:]+",d1[1,2]))[[1]])
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc"
[3] "GCE/CC/PCF/FB/F20.000.000/233546/SK/PW"
However i want to limit my result to
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533"
[3] "GCE/CC/PCF/FB/F20.000.000/233546"
which is i wanted only v1/v2/v3/v4/v5/v6 anything after 6 values should be removed or ends with number after 5 /(slashes).GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc should change to GCE/QSMO/SQSS/SQ/F20.000.000/503533 and GCE/CC/PCF/FB/F20.000.000/233546/SK/PW should change to GCE/CC/PCF/FB/F20.000.000/233546

You can add in a requirement that "GCE" (with space before it) occurs before your [^:]
regmatches(d1[1,2],gregexpr("Our Ref *:\\s+GCE[^:]+",d1[1,2]))
EDIT: try this, you can match groups n numbers of times with {n},
gsub(" *(Our Ref|Date) *:? *", "",
regmatches(d1[1,2],
gregexpr("Our Ref *:\\s+GCE(/[^/-]+){5}",
d1[1,2], perl=T))[[1]])

Here is a different approach using strpslit to split on any non-digit character one or more times: \\D+ followed by a space:
splts <- strsplit(d1$DESCRIPTION, "\\D+ ")[[1]]
splts[grep("GCE", splts)]
# [1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533"
# [3] "GCE/CC/PCF/FB/F20.000.000/233546"

Related

How to delete words from a dataframe column that are present in dictionary in Pandas

An extension to :
Removing list of words from a string
I have following dataframe and I want to delete frequently occuring words from df.name column:
df :
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
which will result in
df2 :
word freq
Clinton 4
Bill 3
James 3
Clark 3
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100.
So above code takes lots of time to run because of complex search.
Is there any effiecient way to make it faster??
Following is a desired output :
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
Thanks in advance!!!!!!!
Use replace by regex created by joined all values of column word, last strip traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
Another solution is add \s* for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam

Accessing list of dictionaries within a dictionary in python

So lets see how bad i messed this up
This is what is supposed to look like
Students
1 - MICHAEL JORDAN - 13
2 - JOHN ROSALES - 11
3 - MARK GUILLEN - 11
4 - KB TONEL - 7
Instructors
1 - MICHAEL CHOI - 11
2 - MARTIN PURYEAR - 13
How ever the two extra arrays are throwing me for a loop. I called the for loop for the first one thus having students and instructors. Then I called the keys and value. Could some one please look at this and point me in the right direction to fix this mess
users = {
'Students': [
{'first_name': 'Michael', 'last_name' : 'Jordan'},
{'first_name' : 'John', 'last_name' : 'Rosales'},
{'first_name' : 'Mark', 'last_name' : 'Guillen'},
{'first_name' : 'KB', 'last_name' : 'Tonel'}
],
'Instructors': [
{'first_name' : 'Michael', 'last_name' : 'Choi'},
{'first_name' : 'Martin', 'last_name' : 'Puryear'}
]
}
for i in users:
for i in Students:
print ([i['first_name'], i['last_name']] , + len([i['first_name'], i['last_name']
for i in Instructors:
print ([i['first_name'], i['last_name']] , + len([i['first_name'], i['last_name'0]
Basically, you are trying to override the value of i with your loops, you should avoid doing it, also when you just use for i in users: you are iterating only through the keys of dict users, you can use dict.items() or dict.iteritems() to access both key, value pair in the dictionary.
And here is your updated code, just in case.
for usr, info in users.iteritems():
print "{}".format(usr)
for i, _info in enumerate(info, 1):
name = "{first_name} {last_name}".format(**_info).upper()
print "{} - {} - {}".format(i, name, len(name)-1)
I will leave it to you to understand the code and the builtin functions being used in the code.

What is wrong with my regular expression in R? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to extract the label, name, address, city, zip, and distance from the following text:
A
Carl's Jr.
308 WESTWOOD PLAZA
LOS ANGELES, CA 90095-8355
0.0 mi.
B
Carl's Jr.
2727 SANTA MONICA
SANTA MONICA, CA 90404-2407
4.8 mi.
...
...
Here is my regular expression pattern and code, but I get a matrix of NA values.
p <- "(^[AZ]\\n)^(\\w+.\\w+\\s\\w+.\\s*\\w*)\\n^(\\d+\\w+\\s*\\w*\\s*\\w*)\\n^(\\w+\\s*\\w*),\\s(CA)\\s(\\d+-*\\d*)\\n^(\\d+.\\d*)\\smi."
matches <- str_match(cj, p)
Do I have a syntax error in my pattern?
Maybe try strsplit() instead. See regex101 for an explanation of the regex used below. Afterwards, we can figure out how many rows there will be by finding the number of single character elements.
s <- strsplit(x, "\n+|, | (?=[0-9]+)", perl = TRUE)[[1]]
as.data.frame(matrix(s, sum(nchar(s) == 1), byrow = TRUE))
# V1 V2 V3 V4 V5 V6 V7
# 1 A Carl's Jr. 308 WESTWOOD PLAZA LOS ANGELES CA 90095-8355 0.0 mi.
# 2 B Carl's Jr. 2727 SANTA MONICA SANTA MONICA CA 90404-2407 4.8 mi.
Data:
x <- "A\n\nCarl's Jr.\n\n308 WESTWOOD PLAZA\n\nLOS ANGELES, CA 90095-8355\n\n0.0 mi.\n\nB\n\nCarl's Jr.\n\n2727 SANTA MONICA\n\nSANTA MONICA, CA 90404-2407\n\n4.8 mi."
Here's a way to do it without regular expressions
library(dplyr)
library(tidyr)
text =
"A
Carl's Jr.
308 WESTWOOD PLAZA
LOS ANGELES, CA 90095-8355
0.0 mi.
B
Carl's Jr.
2727 SANTA MONICA
SANTA MONICA, CA 90404-2407
4.8 mi." %>% textConnection %>% readLines
result =
data_frame(text = text) %>%
filter(text != "") %>%
mutate(type = c("ID", "name", "street_address", "city_state_zip", "distance") %>%
rep_len(n()),
index = ceiling((1:n())/5)) %>%
spread(type, text) %>%
separate(city_state_zip, c("city", "state_zip"), sep = ", " ) %>%
separate(state_zip, c("state", "zip"), sep = " ") %>%
separate(distance, c("distance", "unit"), sep = " ") %>%
mutate(distance = as.numeric(distance))

How to get DateCompare() to behave in ColdFusion 10?

I'm using CF10 with latest update level on Windows in Pacific Standard Time. I need a datecompare() combination that returns 0 but I cannot get it to behave every since Adobe decided to change the behavior of DateConvert() and DateCompare()
<cfset filePath = getBaseTemplatePath()>
<cfset fileinfo = getFileInfo(filePath)>
<cfset lastModified = fileinfo.lastModified>
<cfset lastModifiedUTC = dateConvert("local2utc", lastModified)>
<cfset lastModifiedUTC2 = dateAdd("s", getTimezoneInfo().UtcTotalOffset, lastModified)>
<cfset lastModifiedHttpTime = getHttpTimeString(lastModified)>
<cfset parseLastModifiedHttpTimeSTD = parseDateTime(lastModifiedHttpTime)>
<cfset parseLastModifiedHttpTimePOP = parseDateTime(lastModifiedHttpTime, "pop")>
<cfoutput>
<pre>
lastModified (local) : #datetimeformat(lastModified, 'long')#
lastModifiedUTC : #datetimeformat(lastModifiedUTC, 'long')#
lastModifiedUTC2 : #datetimeformat(lastModifiedUTC2, 'long')#
datecompareLmUTC : #dateCompare(lastModifiedUTC, lastModifiedUTC2)# //wtf
lastModifiedHttpTime : #lastModifiedHttpTime#
parseLastModifiedHttpTimeSTD : #datetimeformat(parseLastModifiedHttpTimeSTD, 'long')#
parseLastModifiedHttpTimePOP : #datetimeformat(parseLastModifiedHttpTimePOP, 'long')#
I need a datecompare() combination that returns 0
------------------------------------------------
DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP) : #DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP)#
DateCompare(lastModifiedUTC2, parseLastModifiedHttpTimePOP) : #DateCompare(lastModifiedUTC2, parseLastModifiedHttpTimePOP)#
CF Version : #server.coldfusion.productVersion#, update level: #server.coldfusion.updatelevel#
</pre>
</cfoutput>
OUTPUT:
lastModified (local) : September 11, 2015 7:10:23 PM PDT
lastModifiedUTC : September 12, 2015 2:10:23 AM UTC
lastModifiedUTC2 : September 15, 2015 4:58:22 PM PDT
datecompareLmUTC : -1 //wtf
lastModifiedHttpTime : Sat, 12 Sep 2015 02:10:23 GMT
parseLastModifiedHttpTimeSTD : September 12, 2015 2:10:23 AM PDT
parseLastModifiedHttpTimePOP : September 12, 2015 2:10:23 AM UTC
I need a datecompare() combination that returns 0
------------------------------------------------
DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP) : 1
DateCompare(lastModifiedUTC2, parseLastModifiedHttpTimePOP) : 1
CF Version : 10,0,17,295085, update level: 17
I'm pulling my hair out.
(Too long for comments)
I did some digging with CF11, based on the blog comments. From what I could tell, the reason the initial comparison fails is that although the first two dates look similar:
// code
lastModifiedUTC : #DateTimeFormat(lastModifiedUTC, "yyyy-mm-dd HH:nn:ss.L zzz")#
lastModifiedUTC2 : #DateTimeFormat(lastModifiedUTC2, "yyyy-mm-dd HH:nn:ss.L zzz")#
// output
lastModifiedUTC : 2015-09-13 19:51:46.219 UTC
lastModifiedUTC2 : 2015-09-13 19:51:46.219 PDT
... due to time zone differences, internally the objects represent a different point in time. That is why dateCompare() fails to return 0. (The third comparison fails for the same reason.)
// code
lastModifiedUTC : #lastModifiedUTC.getTime()#
lastModifiedUTC2 : #lastModifiedUTC2.getTime()#
// output
lastModifiedUTC : 1442173906219
lastModifiedUTC2 : 1442199106219
Notice if you compare lastModifiedUTC to the original (local) date, it works as expected? Despite the different time zones, both objects still represent the same point in time internally:
// code
dateCompare : #dateCompare(lastModifiedUTC, lastModified)#
lastModifiedUTC : #lastModifiedUTC.getTime()#
lastModified : #lastModified.getTime()#
lastModifiedUTC : #DateTimeFormat(lastModifiedUTC, "yyyy-mm-dd HH:nn:ss.L zzz")#
lastModified : #DateTimeFormat(lastModified, "yyyy-mm-dd HH:nn:ss.L zzz")#
// output
dateCompare : 0
lastModifiedUTC : 1442173906219
lastModified : 1442173906219
lastModifiedUTC : 2015-09-13 19:51:46.219 UTC
lastModified : 2015-09-13 12:51:46.219 PDT
Curiously, the second comparison also fails to return 0, despite the fact that both objects seem to have the same time and time zone. However, if you look at the internal time values the milliseconds differ. The milliseconds of the POP value are always zero. DatePart reports the same result. That sort of makes sense, since the POP date was created by parsing a string which does not contain milliseconds. Yet that does not explain why DateTimeFormat shows the milliseconds as non-zero.
The second comparison fails to return 0 because the two dates have different millisecond values. Unlike the file date, the POP date was created by parsing a string that does not contain milliseconds, so that date part is always zero. Since dateCompare() performs a full comparison (including milliseconds) the two dates are not equal.
// code
lastModifiedUTC : #DateTimeFormat(lastModifiedUTC, "yyyy-mm-dd HH:nn:ss.L zzz")#
parseLastModifiedHttpTimePOP : #DateTimeFormat(parseLastModifiedHttpTimePOP, "yyyy-mm-dd HH:nn:ss.L zzz")#
lastModifiedUTC : #lastModifiedUTC.getTime()#
parseLastModifiedHttpTimePOP : #parseLastModifiedHttpTimePOP.getTime()#
datePart(lastModifiedUTC) : #datePart("l", lastModifiedUTC)#
datePart(parseLastModifiedHttpTimePOP) : #datePart("l", parseLastModifiedHttpTimePOP)#
// output
lastModifiedUTC : 2015-09-13 19:51:46.219 UTC
parseLastModifiedHttpTimePOP : 2015-09-13 19:51:46.0 UTC
lastModifiedUTC : 1442173906219
parseLastModifiedHttpTimePOP : 1442173906000
datePart(lastModifiedUTC) : 219
datePart(parseLastModifiedHttpTimePOP) : 0
However, on a good note, that means the comparison works if you skip the milliseconds and only compare down to the "second" ie dateCompare(date1, date2, "s"):
// code
DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP, "s") : #DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP, "s")#
// output
DateCompare(lastModifiedUTC, parseLastModifiedHttpTimePOP, "s") : 0
As an aside, I am not sure why Adobe chose to change the behavior of something as critical as UTC dates .. Unfortunately, I do not know that there is much you can do about it other than the options mentioned in the blog comments a) Change the jvm time zone or b) create your own version of dateConvert and use that instead.
Boy what a mess....

Regular Expressions in R - exclude keyword

There are two variables in my data set with similar names: "JE.Description" and "Field.Description". How can I target the column index of the "JE.Description" column, so as to exclude the word "Field" from the RegExp search? In other words, I would like to modify the command below to only returns the column index of "JE.Description":
The data set is frequently updated and sometimes the "JE.Description" string is shown just as "Description". That is why I am seeking a solution to explicitly excluded the keyword "Field".
r1 <- c(1:5)
r2 <- c(1:5)
df <- data.frame(r1,r2)
names(df)[1] <- "JE.Description"
names(df)[2] <- "Field.Description"
y <- grep("!^Field^Description",perl = TRUE, colnames(df))
RETURNS: integer[0]
Thanks,
To match every string containing "Description" except for those in which it's immediately preceded by the "Field.", use a negative lookbehind assertion:
## The regex pattern
pat <- "(?<!Field\\.)Description"
## Try it out
x <- c("Description", "Field.Description", "FieldDescription", "xyz Description")
grep(pat, x, perl=TRUE) # Note: lookahead & lookbehind assertions need perl=TRUE
# [1] 1 3 4
Alternatively, if the substring "field" might occur in some other position relative to "Description", (and perhaps in either upper or lower-case version) it might be simpler to just grepl() twice and use Boolean operators to combine the results:
x <- c("Description", "fieldDescription", "Field-of-Description",
"Description field")
which(grepl("Description", x) & !grepl("field", x, ignore.case=TRUE))
[1] 1
mydata<-structure(list(Description = c(21, 21, 22.8, 21.4, 18.7, 18.1,
14.3, 24.4, 22.8, 19.2), Field.Description = c(6, 6, 4, 6, 8,
6, 8, 4, 4, 6)), .Names = c("Description", "Field.Description"
), row.names = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Hornet Sportabout", "Valiant", "Duster 360",
"Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")
mydata[grep("^Description",names(mydata))]
Description
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
Valiant 18.1
Duster 360 14.3
Merc 240D 24.4
Merc 230 22.8
Merc 280 19.2