delete the words with length greater than X in R - regex

In R programming after i remove the punctuation, numbers and non-ascii characters, i remained with many words with long characters:
ques1<-gsub("[[:digit:]]"," ", ques1,perl=TRUE)
ques1<-gsub("[[:punct:]]"," ", ques1,perl=TRUE)
ques1<-iconv(ques1, "latin1", "ASCII", sub=" ")
ques1<-rm_white(ques1)
ques1
I checked the longest length of character is 35 using
max(nchar(strsplit(ques1, " ")[[1]]))
[1] 35
Now, i want to remove the words which has more than 10 characters, as i didn't want them, such as
wwwhotmailcomlearnbyexample
Please help me out !!!

Use the following gsub:
ques1 = "A long sentence with long wwwhotmailcomlearnbyexample"
gsub("\\b[[:alpha:]]{11,}\\b", "", ques1, perl=T)
The \\b[[:alpha:]]{11,}\\b regex will match words with length of 11 or more (\\b is a word boundary and [:alpha:] stands for any letter).
See IDEONE demo

Related

How to Trim a Leading and Trailing char in regular expressions?

I have a requirement to trim a leading and trailing character of a fixed length column.
Ex: I have column IdNumber which is of fixed length say 11, with below values
X3343438594
7743438534X
I want to trim the leading and trailing X, and result should look like this.
3343438594
7743438534
Try this:
Search: ^X(?=\d{11}$)|(?<=^\d{11})X$
Replace: <blank>
Regex breakdown:
^X means "start of input then X"
(?=\d{11}$) means "followed by 11 digits then end"
| means "logical OR"
(?<=^\d{11}) means "preceded by start then 11 digits"
X$ means "X then end of input"
You want to delete all matches, so replace them with nothing.
var re = /(?=^X|X$)(([A-Z])(\d{10})(\s)(\d{10})([A-Z]))/;
var str = 'X3343438594 7743438534X';
var subst = '$3$4$5';
var result = str.replace(re, subst);
alert(result);
The regex first asserts that the string should have an X at the beginning or at the end, regardless of the length of your data (not necessarily 11 characters). If that's the case, it tests for a pattern that starts with one letter, followed by 10 digits (totalling 11 characters), then a space, then ten digits followed by one letter (another 11 characters).

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

R regular expression issue

I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.
You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).
Try this
\/(\d+).*
Demo
Output:
MATCH 1
1. [26-29] `123`
MATCH 2
1. [91-93] `15`
MATCH 3
1. [132-137] `25189`
MATCH 4
1. [197-204] `5418874`

R Regex number followed by punctuation followed by space

Suppose I had a string like so:
x <- "i2: 32390. 2093.32: "
How would I return a vector that would give me the positions of where a number is followed by a : or a . followed by a space?
So for this string it would be
"2: ","0. ","2: "
The regex you need is just '\\d[\\.:]\\s'. Using stringr's str_extract_all to quickly extract matches:
library(stringr)
str_extract_all("i2: 32390. 2093.32: ", '\\d[\\.:]\\s')
produces
[[1]]
[1] "2: " "0. " "2: "
You can use it with R's built-in functions, and it should work fine, as well.
What it matches:
\\d matches a digit, i.e. number
[ ... ] sets up a range of characters to match
\\. matches a period
: matches a colon
\\s matches a space.

Extract 2nd to last word in string

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)
sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence
Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]
Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]