remove characters from a string in a data frame

remove characters from a string in a data frame - regex

I have a data frame where column "ID" has values like these:
1234567_GSM00298873
1238416_GSM90473673
98377829
In other words, some rows have 7 numbers followed by "_" followed by letters and numbers; other rows have just numbers
I want to remove the numbers and the underscore preceding the letters, without affecting the rows that have only number. I tried
dataframe$ID <- gsub("*_", "", dataframe$ID)
but that only removes the underscore. So I learned that * means zero or more.
Is there a wildcard, and a repetition operator such that I can tell it to find the pattern "anything-seven-times-followed-by-_"?
Thanks!

Your regular expression syntax is incorrect. You have nothing preceding your repetition operator.
dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID)
This matches any character of: 0 to 9 ( 1 or more times ) that is preceded by an underscore.
Working Demo

Something like this?:
dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)

The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you.
"[0-9]*_" will match numbers followed by '_'
"[0-9]{7}_" will match 7 numbers followed by '_'
".{7}_" will match 7 characters followed by '_'

A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string.
ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829")
ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)

Related

Regex for a string with alpha numeric containing a '.' character

I have not been able to find a proper regex to match any string not starting and ending with some condition.
This matches
AS.E
23.5
3.45
This doesn't match
.263
321.
.ASD
The regex can be alpha-numeric character with optional '.' character and it has to be with in range of 2-4(minimum 2 chars & maximum 4 chars).
I was able to create one ->
^[^\.][A-Z|0-9|\.]{2,4}$
but with this I couldn't achieve mask '.' character at the end of regex.
Thanks.

Maybe not the most optimized but a working one. Created step by step:
The first character should be alphanumeric
^[a-zA-Z0-9]
0, 1 or 2 character alphanumeric or . but not matching end of string
[a-zA-Z0-9\.]{0,2}
an alphanumeric character matching end of string
[a-zA-Z0-9]$
Concatenate all of this to obtain your regex
^[a-zA-Z0-9][a-zA-Z0-9\.]{0,2}[a-zA-Z0-9]$
Edit: This regex allows multiple dots (up to 2)

If I guessed correctly, you want to match all words that are
Between 2 and 4 characters long ...
... and start and end with a character from [A-Z0-9] ...
... and have characters from [A-Z0-9.] in the middle ...
... and are not preceded or followed by a ..
Try this regex to match all these substrings in a text:
(?<=^|[^.])[A-Z0-9][A-Z0-9.]{0,2}[A-Z0-9](?=$|[^.])
However, note that this will match the AA in .AAAA.. If you don't want this match, then please give more details on your requirements.
When you are only interested in the number of matches, but not the matched strings, then you could use
(^|[^.])[A-Z0-9][A-Z0-9.]{0,2}[A-Z0-9]($|[^.])
If you have one string, and want to know whether that string completely matches or not, then use
^[A-Z0-9][A-Z0-9.]{0,2}[A-Z0-9]$
If there may be at most one . inside the match, replace the part [A-Z0-9.]{0,2} with ([A-Z0-9]?[A-Z0-9.]?|[A-Z0-9.]?[A-Z0-9]?).

You can use this pattern to match what you say,
^[^\.][a-zA-Z0-9\.]{2,4}[^\.]$
Check the result here..
https://regex101.com/r/8BNdDg/3

regex that allows 5-10 characters but can have spaces in-between not counting

Problem
Build a regex statement that allows the following:
minimum 5 characters
maximum 10 characters
can contain whitespace but whitespace does not increment character count
any non-whitespace characters increment character count
Test Cases:
expected_to_pass = ['testa', ' test a', 12342, 1.234, 'test a']
expected_to_fail = [' test', 'test ', ' test ', ' ', 1234, 0.1, ' ','12345678901']
Example regex statements and their purpose
Allow 5-10 non-whitespace characters:
[\S]{5,10}$
Allow 5-10 characters regardless of whitespace:
[\s\S]{5,10}$
I've been farting around with this for a few hours and cannot think of the best way to handle this.

How's this?
\s*(?:[\w\.]\s*){5,10}+$
Or:
\s*(?:[\w\.]\s*){5,10}$
Also, if ANY non-whitespace character goes:
\s*(?:\S\s*){5,10}$
You can test it here

There is a wrong assumption in your question: \w doesn't match all non-space-characters, it matches word characters - this means letters, digits and the underscore. Depending on language and flags set, this might include or exclude unicode letters and digits. There are a lot more non-space-characters, e.g. . and |. To match space-characters one usually uses \s, thus \S matches non-space-characters.
You can use ^\s*(?:\S\s*){5,10}$ to check your requirements. You might be able to drop the anchors, if you use some kind of full match functionality (e.g. Java .matches() or Python re.fullmatch).
Depending on the language you use, you might not want to use a regex, but iterate over the string and check character for character. This should usually be faster than regex.
Pseudocode:
number of chars = 0
for first character of string to last character of string
if character is space
inc number of chars by 1
return true if number of chars between 5 and 10

Check this out:
(\s*?\w\s*?){5,10}$
It won't match 1.234 because . is not included inside \w set
If you need it to be included then:
(\s*?[\w|\.]\s*?){5,10}$
(\s*?[\w\.]\s*?){5,10}$
Cheers

Need to understand Regex advanced substring, replace and split to make work with PowerShell

In forums, have seen the signature as:
[string](0..9 | %{[char][int](32+("......................."
).substring(($_*2), 2))}) -replace "\s{1}\b"
it gives the result as an email address.
Need to understand the substring and replace syntax here and to be precise the complete syntax how it is being evaluated.
I understand that [string] is a data type, then foreach every digit from 1 to 9 following result can be in integer or character and finding the ASCII value followed by finding the substring which is multiplied by 2 (don't understand why) and taking only 2 digits of result (result will be be of 2 digits then why to), then comes the replace which replaces with white space and the end \b signifies the word end boundary.
Also, as mentioned in http://ss64.com/ps/syntax-regex.html, how
PS> 'ABCD' -replace "([AC])(.)",'$2-$1'
B-AD-C
results in B-AD-C.
What's the period significance here? cant find the meaning which tells me how to use it, I have tried removing it and it results in
PS> 'ABCD' -replace "([AC])",'$2-$1'
$2-AB$2-CD
Why the period is significant with capture groups in regex?
I am in this for almost a week now but not able to find the exact meaning.
Any help would be appreciated on this.
Regards
Merry X'Mas

I think you are supposed to replace the dots with digits.
For each digit n from 0 to 9, pick the n'th pair of digits in the string, add 32, and convert to that Unicode character. Then remove every space that comes before a word character.
32 is the code for a space (" "), and is followed by 94 printable characters. See List of Unicode characters, Basic Latin (Wikipedia) for a list.
[string], [char], and [int] converts a value to the specified type. The % { } syntax is short for ForEach-Object { }.
The regex at the end, matches one space character, followed by a word-boundary. —That is, it matches zero characters but only before a word-character (A-Z, a-z, 0-9 and "_")).
The full syntax is: string -replace regex , replacement. Since there are no replacement specified, it defaults to the empty string. Effectively removes any space before a letter or digit.
For the regex ([AC])(.), it will match any A or C followed by another character. It will capture the A or C into group one, and the following character into group 2.
Simple ( ) creates a capture group, and allows you to refer to the matched substring elsewhere. . is a wildcard, and matches any character except newline (U+00A0).
ABCD -> {
Match 1 = {
Text = "AB"
Capture 1 = { Text = "A" }
Capture 2 = { Text = "B" }
}
Match 2 = {
Text = "CD"
Capture 1 = { Text = "C" }
Capture 2 = { Text = "D" }
}
}
The replacement string $2-$1 says to put the text captured into group 2 first, followed by a dash, followed by the text captured into group 1.
$1 refers to the text captured into group 1, and $2 refers to the text captured into group 2. Match 1 will be replaced by "B-A", and Match 2 by "D-C", resulting in "B-AD-C".
A good source for the syntax and mechanics of regex is http://www.regular-expressions.info/

Regex for {!Customobject_relateobject.name}

I don't know regex can you please help me to get regex for
{!Customobject_relateobject.name}
String "Customobject_relateobject.name" can contain only "_" and "." in middle of word not even in first or last
"{!" and "}" is mandatory
Thanks in Advance.

You can use the following regex:
\{![a-zA-Z0-9_.]*}
See demo
The regex means:
\{! - matches {! literally
[a-zA-Z0-9_.]* - 0 or more (due to *) characters that are lower- or uppercase Latin letters, digits from 0 to 9, underscore or dot
} - literal }.

{!^[a-zA-Z0-9]?[a-zA-Z0-9._]*[a-zA-Z0-9]?$} if empty strings like {!} are not allowed and only latin and digits should be inside the parenthesis

I guess the word can't end with '.' or '_' or have any digit in it. So this regex will give you what you want:
\{!(([a-zA-Z]+(_|\.)?)+[a-zA-Z]+)\}
If you want digits have this regex:
\{!(([a-zA-Z0-9]+(_|\.)?)+[a-zA-Z0-9]+)\}
Don't use the '\w' because it match the '_', and you can end with two together.

Match comma after 5 characters

I have a requirement where I need to validate an input field.
It should be alphanumeric characters after 5 characters there should be comma (,)
Example:
K9,d3,dk,33,kd
[a-zA-Z0-9]{5}[,]
but after K9, it gives that Regex pattern not matched

Because your regex says that the "," comma has to be after exactly 5 characters.
But you want a comma after every second char.
Try this:
[a-zA-Z0-9,]{5}
And if it is always 2x chars followed by one comma try this:
([a-zA-Z0-9]{2},)+
This means two chars followed by a comma which can appear one or more times.
And last but not least a shorten form:
(\w{2},)+
Good reference explaining regex in java: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

remove characters from a string in a data frame - regex

Your regular expression syntax is incorrect. You have nothing preceding your repetition operator. dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID) This matches any character of: 0 to 9 ( 1 or more times ) that is preceded by an underscore. Working Demo

Something like this?: dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)

The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you. "[0-9]*_" will match numbers followed by '_' "[0-9]{7}_" will match 7 numbers followed by '_' ".{7}_" will match 7 characters followed by '_'

A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string. ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829") ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)

Related

Regex for a string with alpha numeric containing a '.' character

regex that allows 5-10 characters but can have spaces in-between not counting

Need to understand Regex advanced substring, replace and split to make work with PowerShell

Regex for {!Customobject_relateobject.name}

Match comma after 5 characters

Categories

Resources