Gsub Data frame Replace only exact Cell value matches, no substring - regex

The problem I am facing is that I have a dataframe called uniqindex which looks like the following.
S5 1 Below 25
S5 2 25-30
S5 3 31-35
S5 4 36-40
S5 5 41-45
S5 6 46-50
A sample line of the file where I intend to replace the numeric codes with the age ranges looks like -
S5 4 3 5 3 7 4 3 4 4 7
Following is the code that I run
range<-c('S1','S2a','S2b','S4','S5','S5a','S6','S8','S9','Q8')
FinalOut<-NULL
AddColName<-NULL
for (y in range)
{
df<-copytrans1[copytrans1[,1]==as.character(y),]
uniqindex<-index1[index1[,1]==y,]
looptime<-nrow(uniqindex)
for (k in 1:looptime)
{
df <- as.data.frame(lapply(df, FUN = function(x) gsub(uniqindex[k,2],uniqindex[k,3], x)))
}
FinalOut<-rbind(FinalOut,df)
AddColName<-rbind(AddColName,cbind(as.data.frame(y),df))
}
The problem that I face is that as the substitutions run sequentially, this is the output that I get
S5a S5a ageage41_501_40 ageage41_501_40 age41_50 ageage41_501_40 age41_50 ageage41_501_40 ageage41_501_40 ageage41_501_40 ageage41_501_40 age41_50 age41_50
I want to know how can I change my code to only change exact matches. Currently, 1 would be changed to 25-30 and in the second iteration 2 of 25-30 is changed to 25-305-30

To match the one-digit index only as an isolated word rather than within a two-digit age, you can put the symbols \< and \> at the beginning and end of the pattern:
gsub(paste('\\<', uniqindex[k,2], '\\>', sep=''), uniqindex[k,3], x)

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

generate all combinations of strings based on template

How to generate all combinations of strings based on template?
For example:
- Template string of
"{I|We} want {|2|3|4} {apples|pears}"
The curly braces "{...}" identify a group or words, each word separated by "|".
The class should generate strings with every combination of words within each word group.
I know it's finite automata, and also regex. How to efficiently generate combination?
For example
G[0][j] [want] G[1][j] G[2][j]"
G[0] = {I, We}
G[1] = {2, 3, 4}
G[2] = {apples, pears}
firstly, generate all possible combination c = [0..1][0..2][0..1]:
000
001
010
011
020
021
100
101
110
111
120
121
and then for each c replace G[i][j] by G[i][c[i]]
Shell glob
$ for q in {I,We}\ want\ {2,3,4}\ {apples,pears}; do echo "$q" ; done
I want 2 apples
I want 2 pears
I want 3 apples
I want 3 pears
I want 4 apples
I want 4 pears
We want 2 apples
We want 2 pears
We want 3 apples
We want 3 pears
We want 4 apples
We want 4 pears
The most functional solution to this problem I found so far is the Python module sre_yield.
The goal of sre_yield is to efficiently generate all values that can
match a given regular expression, or count possible matches
efficiently.
Emphasis added by me.
To apply it to your stated problem: Formulate your template as regex pattern and use it in sre_yield to get all possible combinations or count possible matches like this:
import sre_yield
result = set(sre_yield.AllStrings("(I|We) want (|2|3|4) (apples|pears)"))
result.__len__()
result
Output:
16
{'I want apples',
'I want pears',
'I want 2 apples',
'I want 2 pears',
'I want 3 apples',
'I want 3 pears',
'I want 4 apples',
'I want 4 pears',
'We want apples',
'We want pears',
'We want 2 apples',
'We want 2 pears',
'We want 3 apples',
'We want 3 pears',
'We want 4 apples',
'We want 4 pears'}
PS: Instead of a list as shown on the project page I use a set to avoid duplicates. If this is not what you want go with a list.
The principle is:
Regex -> NFA
NFA -> minimal DFA
DFS-walk through the DFA (collecting all characters)
This principle is implemented, e.g. in RexLex:
DeterministicAutomaton dfa = Pattern.compileGenericAutomaton("(I|We) want (2|3|4)? (apples|pears)")
.toAutomaton(new FromGenericAutomaton.ToMinimalDeterministicAutomaton());
if (dfa.getProperty().isAcyclic()) {
for (String s : dfa.getSamples(1000)) {
System.out.println(s);
}
}
Convert each set of strings {...} into a string array so you have n arrays.
So for "{I|We} want {|2|3|4} {apples|pears}" we would have 4 arrays.
Place each of those arrays into another array. In my example I will call the collection
This is Java code, but its simple enough that you should be able to convert it to any language. I didn't test but it should work.
void makeStrings(String[][] wordSet, ArrayList<String> collection) {
makeStrings(wordSet, collection, "", 0, 0);
}
void makeStrings(String[][] wordSet, ArrayList<String> collection, String currString, int x_pos, int y_pos) {
//If there are no more wordsets in the whole set add the string (this means 1 combination is completed)
if (x_pos >= wordSet.length) {
collection.add(currString);
return;
}
//Else if y_pos is outof bounds (meaning no more words within the smaller set {...} return
else if (y_pos >= wordSet[x_pos].length) {
return;
}
else {
//Generate 2 new strings, one to send "vertically " and one "horizontally"
//This string accepts the current word at x.y and then moves to the next word subset
String combo_x = currString + " " + wordSet[x_pos][y_pos];
makeStrings(wordSet, collection, combo_x, x_pos + 1, 0);
//Create a copy of the string and move to the next string within the same subset
String combo_y = currString;
makeStrings(wordSet, collection, combo_y, x_pos , y_pos + 1);
}
}
*Edit for corrections

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924
m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1
1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.

R + reshape: using colsplit w/regex

I am trying to use colsplit to break up a vector in a dataframe. The fact that we have regular expression as an arg to colsplit makes me think it can be flexible, but I am having trouble (it might just be that I'm not understanding regex in R).
Here's the problem:
let's create a vector...
> library(reshape)
> my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123","x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_123","x02_ccc_123"))
I would like to split it into two columns upon the first underscore.
In other words, I want my end result to be this...
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123
I am trying to find the right regex inside of colspan that will do it, but no luck. Here's the closest I can get...
> colsplit(my_var_1, split="_", c("x","whatever"))
x whatever NA.
1 x00 aaa 123
2 x00 bbb 123
3 x00 ccc 123
4 x01 aaa 123
5 x01 bbb 123
6 x01 ccc 123
7 x02 aaa 123
8 x02 bbb 123
9 x02 ccc 123
That uses the split regex as a simple delimiter and it gives me three columns. I would like to not split the second underscore (to make it worse, in my real data I have an arbitrary number of underscores not just two).
Is there an expression I can use for "split" that will give what I want?
I had hoped that the regex in colsplit would allow me to match on groups and the group matches would be the content of splits but that does not appear to be the case.
* edit (thanks to #Joshuaulrich) colsplit works "as intended" when using the newer reshape2 !!!
Your code throws an error for me:
> colsplit(my_var_1, split="_", c("x","whatever"))
Error in colsplit(my_var_1, split = "_", c("x", "whatever")) :
unused argument(s) (split = "_")
split isn't an argument to colsplit. The argument you want is pattern, or you can just rely on positional matching:
> colsplit(my_var_1, "_", c("x","whatever"))
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123