get a particular row in R to a new table - regex

i am new to R so please guide me with this.
Below shown is a simple table called Order.
Col1 Col2 Col3
hey hi july 12,2013
hey hi june 12,2013
hey hi April 12,2013
hey hi April 14,2012
If i want to write a query such that i get this as result in a new table ie. i need to use regular expression to match for a part of string in Col3 and then count.
july june April
1 1 2
please help me if anyone knows how to do it.

You can use sub to extract the months' names and table to count the frequencies:
dat <- read.table(text = "Col1 Col2 Col3
hey hi 'july 12,2013'
hey hi 'june 12,2013'
hey hi 'April 12,2013'
hey hi 'April 14,2012'", header = TRUE)
table(sub("^(\\w+) .*", "\\1", dat$Col3))
# April july june
# 2 1 1
How does sub("^(\\w+) .*", "\\1", dat$Col3) work?
The function sub performs replacements in strings. The strings inside quotes are regular expressions. ^ is the beginning of the string, \\w is a word character, + means one or multiple. is a literal space. .* means any number of any character. The parentheses are used to create a group. The first (and only) group (\\w+) matches word characters at the beginning of the string. The second argument in the sub function, "\\1" is used to replace the whole string with the substring representing the first group. In short: the whole string is replaced by the first word.

Data:
data <- read.table(text = "Col1 Col2 Col3
hey hi 'july 12,2013'
hey hi 'june 12,2013'
hey hi 'April 12,2013'
hey hi 'April 14,2012'", header = TRUE)
An answer using dates:
#tranform data in POSIXlt
data$Col3 <- as.POSIXlt(data$Col3, format="%B %d, %Y")
## group using table with POSIXlt numbers (0 is january)
table(data$Col3$mon)
3 5 6
2 1 1
## group using table with normal month numbers
table(month(data$Col3))
4 6 7
2 1 1
## group using aggregate with POSIXlt numbers (0 is january)
aggregate(data$Col1, by=list(data[,"Col3"]$mon), length)
#result
Group.1 x
1 3 2
2 5 1
3 6 1
## group using aggregate with normal month numbers
aggregate(data$Col1, by=list(month(data$Col3)), length)
#result
Group.1 x
1 4 2
2 6 1
3 7 1
PS: whe you get data$Col3$mon in POSIXlt january is 0, so april is 3 and not 4 as you would expect. To get "normal" month numbers you should use month(data$Col3) - just realised that reading Ananda's comment.
If you want a prettier version (by Ananda Mahto):
Col3 <- as.POSIXlt(data$Col3, format="%B %d, %Y"); table(month.name[month(Col3)])
April July June
2 1 1

Related

Simpler regular expression for oracle regexp_replace

I have a | separated string with 20 |s like 123|1|42|13||94123|2983191|2|98863|...|211| upto 20 |. This is a oracle db column. The string is just 20 numbers followed by |.
I am trying to get a string out from it where I remove the numbers at position 4,6,8,9,11,12 and 13. Also, need to move the number at position 16 to position 4. Till now, I have got a regex like
select regexp_replace(col1, '^((\d*\|){4})(\d*\|)(\d*\|)(\d*\|)(\d*\|)((\d*\|){2})(\d*\|)((\d*\|){3})((\d*\|){2})(\d*\|)(.*)$', '\1|\4|\6||\9||||||||') as cc from table
This is where I get stuck since oracle only supports backreference upto 9 groups. Is there any way to make this regex simpler so it has lesser groups and can be fit into the replace? Any alternative solutions/suggestions are also welcome.
Note - The position counter begins at 0, so 123 in above string is the 0th number.
Edit: Example -
Source string
|||14444|10107|227931|10115||10118||11361|11485||10110||11512|16666|||
Expected result
|||16666|10107||10115||||||11512||||
You can get the result you want by removing capture groups for the numbers you are removing from the string anyway, and writing (for example) ((\d*\|){2}) as (\d*\|\d*\|). This reduces the number of capture groups to 7, allowing your code to work as is:
select regexp_replace(col1,
'^(\d*\|\d*\|\d*\|\d*\|)\d*\|(\d*\|)\d*\|(\d*\|)\d*\|\d*\|(\d*\|)\d*\|\d*\|\d*\|(\d*\|\d*\|)(\d*\|)(.*)$',
'\1\6\2|\3||\4|||\5|\7') as cc
from table
Output (for your test data and also #Littlefoot good column example):
CC
|||14444|16666|227931|||||11361|||||11512|||||
0|1|2|3|16|5||7|||10||||14|15||17|18|19|
Demo on dbfiddle
As there's unique column (ID, as you said), see if this helps:
split every column into rows
compose them back (using listagg) which uses 2 CASEs:
one to remove values you don't want
another to properly sort them ("put value at position 16 to position 4")
Note that my result differs from yours; if I counted it correctly, 16666 isn't at position 16 but 17 so - 11512 has to be moved to position 4.
I also added another dummy row which is here to confirm whether I counted positions correctly, and to show why you have to use lines #10-12 (because of duplicates).
OK, here you are:
SQL> with test (id, col) as
2 (
3 select 1, '|||14444|10107|227931|10115||10118||11361|11485||10110||11512|16666|||' from dual union all
4 select 2, '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20' from dual
5 ),
6 temp as
7 (select replace(regexp_substr(col, '(.*?)(\||$)', 1, column_value), '|', '') val,
8 column_value lvl,
9 id
10 from test cross join table(cast(multiset(select level from dual
11 connect by level <= regexp_count(col, '\|') + 1
12 ) as sys.odcinumberlist))
13 )
14 select id,
15 listagg(case when lvl in (4, 6, 8, 9, 11, 12, 13) then '|'
16 else val || case when lvl = 20 then '' else '|' end
17 end, '')
18 within group (order by case when lvl = 16 then 4
19 when lvl = 4 then 16
20 else lvl
21 end) result
22 from temp
23 group by id;
ID RESULT
---------- ------------------------------------------------------------
1 |||11512|10107||10115|||||||10110|||16666|||
2 1|2|3|16|5||7|||10||||14|15||17|18|19|20
SQL>

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

Replace String B with String C if it contains (but not exactly matches) String A

I have a data frame match_df which shows "matching rules": the column old should be replaced with the colum new in the dataframes it is applied on.
old <- c("10000","20000","300ZZ","40000")
new <- c("Name1","Name2","Name3","Name4")
match_df <- data.frame(old,new)
old new
1 10000 Name1
2 20000 Name2
3 300ZZ Name3 # watch the letters
4 40000 Name4
I want to apply the matching rules above on a data frame working_df
id <- c(1,2,3,4)
value <- c("xyz-10000","20000","300ZZ-230002112","40")
working_df <- data.frame(id,value)
id value
1 1 xyz-10000
2 2 20000
3 3 300ZZ-230002112
4 4 40
My desired result is
# result
id value
1 1 Name1
2 2 Name2
3 3 Name3
4 4 40
This means that I am not looking for an exact match. I'd rather like to replace the whole string working_df$value as soon as it includes any part of the string in match_df$old.
I like the solution posted in R: replace characters using gsub, how to create a function?, but it works only for exact matches. I experimented with gsub, str_replace_all from stringr but I couldn't find a solution that works for me. There are many solutions for exact matches on SOF, but I couldn't find a comprehensible one for this problem.
Any help is highly appreciated.
I'm not sure this is the most elegant/efficient way of doing it but you could try something like this:
working_df$value <- sapply(working_df$value,function(y){
idx<-which(sapply(match_df$old,function(x){grepl(x,y)}))[1]
if(is.na(idx)) idx<-0
ifelse(idx>0,as.character(match_df$new[idx]),as.character(y))
})
It uses grepl to find, for each value of working_df, if there is a row of match_df that is partially matching and get the index of that row. If there is more than one, it takes the first one.
You need the grep function. This will return the indices of a vector that match a pattern (any pattern, not necessarily a full string match). For instance, this will tell you which of your "old" values match the "10000" pattern:
grep(match_df[1,1], working_df$value)
Once you have that information, you can look up the corresponding "new" value for that pattern, and replace it on the matching rows.
Here are 2 approaches using Map + <<- and a for loop:
working_df[["value2"]] <- as.character(working_df[["value"]])
Map(function(x, y){working_df[["value2"]][grepl(x, working_df[["value2"]])] <<- y}, old, new)
working_df
## id value value2
## 1 1 xyz-10000 Name1
## 2 2 20000 Name2
## 3 3 300ZZ-230002112 Name3
## 4 4 40 40
## or...
working_df[["value2"]] <- as.character(working_df[["value"]])
for (i in seq_along(working_df[["value2"]])) {
working_df[["value2"]][grepl(old[i], working_df[["value2"]])] <- new[i]
}

how to find consecutive repetitive characters in oracle column

Is there any way to find consecutive repetitive characters like 1414, 200200 in a varchar column of an oracle table.
how can we achieve it with regexp ?
Im failing to achieve it with regexp
im my example i can get a consecutive repetition of a number but not a pattern
select regexp_substr('4120066' ,'([[:alnum:]])\1', 7,1,'i') from dual; -- getting output as expected
select regexp_substr('6360360' ,'([[:alnum:]])\1', 7,1,'i') from dual; -- i want to select this also as i have 360 followed by 360
You should be able to use something like this:
[...] WHERE REGEXP_LIKE(field, '(\d+?)\1')
If you're looking for any repetition of characters, or:
[...] WHERE REGEXP_LIKE(field, '^(\d+?)\1$')
If you want to check the whole string in the field.
\d+? will match digits.
( ... ) will store those digits.
\1 refers to the captured digits.
Note: Change to \d to . if you are not checking digits only.
Try this :
SQL> WITH t AS
2 (SELECT '1414,200200,11,12,33,33,1234,1234' test_string
3 FROM DUAL)
4 SELECT LTRIM (SYS_CONNECT_BY_PATH (test_string, ','), ',') names
5 FROM (SELECT ROW_NUMBER () OVER (ORDER BY test_string) rno, test_string
6 FROM (SELECT DISTINCT REGEXP_SUBSTR (test_string,
7 '[^,]+',
8 1,
9 LEVEL
10 ) test_string
11 FROM t
12 CONNECT BY LEVEL <=
13 LENGTH (test_string)
14 - LENGTH (REPLACE (test_string,
15 ',',
16 NULL
17 )
18 )
19 + 1))
20 WHERE CONNECT_BY_ISLEAF = 1 AND ROWNUM = 1
21 CONNECT BY rno = PRIOR rno + 1;
NAMES
--------------------------------------------------------------------------------
11,12,1234,1414,200200,33
None of the delimited string repeat !

Use regexp_instr to get the last number in a string

If I used the following expression, the result should be 1.
regexp_instr('500 Oracle Parkway, Redwood Shores, CA','[[:digit:]]')
Is there a way to make this look for the last number in the string? If I were to look for the last number in the above example, it should return 3.
If you were using 11g, you could use regexp_count to determine the number of times that a pattern exists in the string and feed that into the regexp_instr
regexp_instr( str,
'[[:digit:]]',
1,
regexp_count( str, '[[:digit:]]')
)
Since you're on 10g, however, the simplest option is probably to reverse the string and subtract the position that is found from the length of the string
length(str) - regexp_instr(reverse(str),'[[:digit:]]') + 1
Both approaches should work in 11g
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select '500 Oracle Parkway, Redwood Shores, CA' str
3 from dual
4 )
5 select length(str) - regexp_instr(reverse(str),'[[:digit:]]') + 1,
6 regexp_instr( str,
7 '[[:digit:]]',
8 1,
9 regexp_count( str, '[[:digit:]]')
10 )
11* from x
SQL> /
LENGTH(STR)-REGEXP_INSTR(REVERSE(STR),'[[:DIGIT:]]')+1
------------------------------------------------------
REGEXP_INSTR(STR,'[[:DIGIT:]]',1,REGEXP_COUNT(STR,'[[:DIGIT:]]'))
-----------------------------------------------------------------
3
3
Another solution with less effort is
SELECT regexp_instr('500 Oracle Parkway, Redwood Shores, CA','[^[:digit:]]*$')-1
FROM dual;
this can be read as.. find the non-digits at the end of the string. and subtract 1. which will give the position of the last digit in the string..
REGEXP_INSTR('500ORACLEPARKWAY,REDWOODSHORES,CA','[^[:DIGIT:]]*$')-1
--------------------------------------------------------------------
3
which i think is what you want.
(tested on 11g)