Match a group of words using regex - regex

I have a long string in the format:
' Random Key : Random Value\n Random Long Key : Random Long Value\n...'
and so on.
I am trying to change it to
Random Value:Random Key, Random Long Value:Random Long Key,...
by using regex. I can match a single word by
\w+
but in order to match more than one word i am doing
\w+(\s\w+)*
but that's not giving me the wanted result.

You might use this piece of code to find the key-value pairs:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var regex = new Regex(#"\s*(?<key>(\w+\s?\w+)*)\s*:\s*(?<val>(\w+\s?\w+))\s*");
var input = #" Key : Value\n Long Key : Long Value\n...";
Console.WriteLine(regex.Replace(input, "${key}:${val}").Replace("\\n", ", "));
}
}
The trick is to match "any number of (at least one word character, an optional space, and at least another word character)", which finds keys for us with spaces. The shortest no-space key is two characters though.
I admit that the escaped line breaks aren't replaced by a regex, but this way both expression and code are fairly simple.

If your \n really is part of the string you can match and replace it like this:
/(?:\s*([^\\]+?)\s*:\s*([^\\]+?)\s*)+\\n/g
and substitute it with
$1:$2,
See Demo.
If you have it line-by-line then its easier because you can use multiline matching:
/^\s*(.+?)\s*:\s*(.+?)\s*$/mg
and also substitute it with
$1:$2,
See Demo.

None of the other answers worked fully for me, as they kept matching the space after the final word. the regex that worked as intended is
(\w+(?:\s\w+)*)

Related

End a regular expression pattern with a string

all. I have spent some time now to learn regular expression, but eventually there is a problem I cannot solve properly.
Lets assume the following 'string' (html-extract):
"{'2018-05-02', '2018-01-05', r, '2018-07-01', '2017-07-02', '2016-07-31' random_text XYCCC Letters and 55565798 ]}"
My intention is, to extract all values from '2018-05-02' ... to (and excluding) random_text. I tried to achieve this through chosing the "anything but" structure to achieve this [^a] (not a):
\'[^random]*
The above does not do the job, because random is not a string, but a set of characters, hence the 'r' in the string will split my extracted value.
If there is no r in the text before the word random_text, this would work fine:
\'[^r]*
Is there any way to include a specific string as the end of my sequence. e.g.
start: \'
repeated characters unlike string: [^{my_string}]*
Appreciate any insight :)
This regex will do the job:
'.+'(?= random)
Just replace random with the string you want to exclude at the end.
Demo & explanation

Using regex to match certain text

I try to look for this answer for a while but no luck (sorry if I could describe it well). I am still newbie with regex. I am trying to match a string with only number and a certain delimiter. For example: the patter would be 8/16/32/64/.... the number will be split by '/' with arbitrary amount of number, I could find a way to match them.
My attempt is \d+/\d+? but couldn't get it to work.
You could remove the '/' delimiter and then test for the existence of a number
Here is some C# as an example:
static void Main(string[] args)
{
string text = "8/16/32/64/";
Console.WriteLine(text);
TestForNum(text);
text = "8/16/32/64/b";
Console.WriteLine(text);
TestForNum(text);
Console.ReadKey();
}
private static void TestForNum(string text)
{
string tmp = Regex.Replace(text, #"/", "");
Match m = Regex.Match(tmp, #"^\d+$");
if(m.Success)
{
Console.WriteLine("\t" + m.Groups[0]);
}
else Console.WriteLine("\tno match");
}
A naive approach would be
[\d/]+
However, this does match //// as well as just 12345. To match only "proper" strings:
\d+(/\d+)+
Reads digits followed by delimiter+digits repeated at least once. If trailing/leading delimiters are allowed, then
/?(\d+/)+\d*
If you're using a flavor that uses slashes to quote the regex (like javascript), you'll need to escape them:
/\d+(\/\d+)+/
You can do:
(\d+)(\D|$)
See this work That will split a list of digits delimited by any non digit, so 1?2!3.4 would match
If you want a specific delimiter, such as /:
(\d+)(?:/|$)
As simple as possible:
(\d+\/?)+
Every digit followed by [a] slash, as many as possible. You may use g flag for all matches.

Regex help - match words besides MD5 hashes

I can't figure out a regex that will grab every word besides MD5 hashes. - I'm using [a-zA-Z0-9]+ to match every word. How do I augment that so that it ignores something I'm thinking is like [a-fA-F0-9]{32} which would match the MD5 hashes. My question regards Regex.
8e85d8b3be426bc8d370facdb0ad3ad0
string
stringString
63994b32affec18c2a428cdfcb0e2823
stringSTRINGSTING333
34563994b32dddddddaffec18c2a
stringSTRINGSTINGsrting
Thanks for any help. :)
This kind of thing is usually done with a negative lookahead:
/\b(?![0-9a-f]{32}\b)[A-Za-z0-9]+\b/
At the beginning of each word, (?![0-9a-fA-F]{32}\b) tries to match exactly 32 hexadecimal digits followed by a word boundary. If it succeeds, the regex fails.
The following works fine for me:
/^[a-f0-9]{8}(-)[a-f0-9]{4}(-)[a-f0-9]{4}(-)[a-f0-9]{4}(-)[a-f0-9]{12}$/i
as already said, just grab all words which do not match to be MD5 hashes.
(first, you have to split the string)
var s = "8e85d8b3be426bc8d370facdb0ad3ad0\nstring\nstringString\n63994b32affec18c2a428cdfcb0e2823\nstringSTRINGSTING333\n34563994b32dddddddaffec18c2a\nstringSTRINGSTINGsrting";
words = [];
words_all = s.split(/\s+/);
for (i in words_all) {
word = words_all[i];
if (! word.match(/^[a-fA-F0-9]{32}$/)) { words.push(word) }
}
// words = ["string", "stringString", "stringSTRINGSTING333", "34563994b32dddddddaffec18c2a", "stringSTRINGSTINGsrting"]
(assuming, according to your original code, you want to use javascript)

Match at every second occurrence

Is there a way to specify a regular expression to match every 2nd occurrence of a pattern in a string?
Examples
searching for a against string abcdabcd should find one occurrence at position 5
searching for ab against string abcdabcd should find one occurrence at position 5
searching for dab against string abcdabcd should find no occurrences
searching for a against string aaaa should find two occurrences at positions 2 and 4
Use capturing groups.
foo.*?(foo)
Use a regex like this to match all occurrences in a string. Every returned match will contain a second occurrence as its first captured group.
Here's an example that matches every second occurrence of \d+ in Python using findall:
import re
input = '10 is less than 20, 5 is less than 10'
second_occurrences = re.findall(r'\d+.*?(\d+)', input)
print(second_occurrences)
Output:
['20', '10']
Suppose the pattern you want is abc+d. You want to match the second occurrence of this pattern in a string.
You would construct the following regex:
abc+d.*?(abc+d)
This would match strings of the form: <your-pattern>...<your-pattern>. Since we're using the reluctant qualifier *? we're safe that there cannot be another match of between the two. Using matcher groups which pretty much all regex implementations provide you would then retrieve the string in the bracketed group which is what you want.
If you're using C#, you can either get all the matches at once (i.e. use Regex.Matches(), which returns a MatchCollection, and check the index of the item: index % 2 != 0).
If you want to find the occurrence to replace it, use one of the overloads of Regex.Replace() that uses a MatchEvaluator (e.g. Regex.Replace(String, String, MatchEvaluator). Here's the code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "abcdabcd";
// Replace *second* a with m
string replacedString = Regex.Replace(
input,
"a",
new SecondOccuranceFinder("m").MatchEvaluator);
Console.WriteLine(replacedString);
Console.Read();
}
class SecondOccuranceFinder
{
public SecondOccuranceFinder(string replaceWith)
{
_replaceWith = replaceWith;
_matchEvaluator = new MatchEvaluator(IsSecondOccurance);
}
private string _replaceWith;
private MatchEvaluator _matchEvaluator;
public MatchEvaluator MatchEvaluator
{
get
{
return _matchEvaluator;
}
}
private int _matchIndex;
public string IsSecondOccurance(Match m)
{
_matchIndex++;
if (_matchIndex % 2 == 0)
return _replaceWith;
else
return m.Value;
}
}
}
}
Would something like
(pattern.*?(pattern))*
work for you?
Edit:
The problem with this is that it uses the non-greedy operator *?, which can require an awful lot of backtracking along the string instead of just looking at each letter once. What this means for you is that this could be slow for large gaps.
Back references can find interesting solutions here. This regex:
([a-z]+).*(\1)
will find the longest repeated sequence.
This one will find a sequence of 3 letters that is repeated:
([a-z]{3}).*(\1)
There's no "direct" way of doing so but you can specify the pattern twice as in: a[^a]*a that match up to the second "a".
The alternative is to use your programming language (perl? C#? ...) to match the first occurence and then the second one.
EDIT: I've seen other responded using the "non-greedy" operators which might be a good way to go, assuming you have them in your regex library!

Capturing a repeated group

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.