R - regexp in each string of a table of char - regex

I would like to make a regex operation at each string of an array.
For instance, take the first characters of each string before a '-'. The results will be store in another array.
('Hello-1','Hi-2','Hola-3')
will give
('Hello','Hi','Hola')
Is there a way do do it in R without a loop ?
Thanks!

Based on the updated question, we can match the character '-' followed by one or more characters until the end of the string and replace with ''.
sub('-.*$', '', test)

Related

Split string on un-escaped character in D

What is the best way to split a string on an un-escaped character?
Eg. split this (raw) string
`example string\! it is!split in two parts`
on '!', so that it produces this array:
["example string! it is", "split in two parts"]
std.regex.split seems to almost be the right thing. There is a problem though, this code matches the correct split character, but also consumes the last character on the left part.
auto text = `example string\! it is!split in two parts`;
return text.split(regex(`[^\\]!`)).map!`a.replace("\\!", "!")`.array;
The whole regex match is removed on split, so this array is the result:
["example string! it i", "split in two parts"]
What is the best way to get to the first array without iterating the string myself?
Use a negative lookbehind:
(?<!\\)\!

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Use regex to find which date separator a user has inputted

I want to write a regex that will return characters in a string not equal to d, M or y.
For example:
in dd.MM.yyyy, I should get a ' . '
in dd/MM/yyyy, I should get a ' / '
Is this possible?
If you try to parse input date, find first non numeric character
[0-9]+([^0-9]).*
If you try to find element in "mask/template/..." then first character not in set
[dMy]+(\.).*
Assuming you will always get a string in that format and casing, then you could use dd(.)MM(.)yyyy. This will match the two strings above and put the seperating character in a group, which you can then later access.

How do you find all text up to the first character x on a line?

Sorry, this is probably really easy. But if you have a delimiter character on each line and you want to find all of the text before the delimiter on each line, what regular expression would do that? I don't know if the delimiter matters but the delimiter I have is the % character.
Your text will be in group 1.
/^(.*?)%/
Note: This will capture everything up the percent sign. If you want to limit what you capture replace the . with the escape sequence of your choice.
In python, you can use:
def GetStuffBeforeDelimeter(str, delim):
return str[:str.find(delim)]
In Java:
public String getStuffBeforeDelimiter(String str, String delim) {
return str.substring(0, str.indexOf(delim));
}
In C++ (untested):
using namespace std;
string GetStuffBeforeDelimiter(const string& str, const string& delim) {
return str.substr(0, str.find(delim));
}
In all the above examples you will want to handle corner cases, such as your string not containing the delimeter.
Basically I would use substringing for something this simple becaues you can avoid scanning the entire string. Regex is overkill, and "exploding" or splitting on the delimeter is also unnecessary because it looks at the whole string.
You don't say what flavor of regex, so I'll use Perl notation.
/^[^%]*/m
The first ^ is a start anchor: normally it matches only the beginning of the whole string, but this regex is in multiline mode thanks the 'm' modifier at the end. [^%] is an inverted character class: it matches any one character except a '%'. The * is a quantifier that means to match the previous thing ([^%] in this case) zero or more times.
you don't have to use regex if you don't want to. depending on the language you are using, there will be some sort of string function such as split().
$str = "sometext%some_other_text";
$s = explode("%",$str,2);
print $s[0];
this is in PHP, it split on % and then get the first element of the returned array. similarly done in other language with splitting methods as well.

Capturing a repeated group

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.