How to create "blocks" with Regex - regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!

To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

Related

Regex: Only matching at the end of String not anywhere in elastic [duplicate]

The following should be matched:
AAA123
ABCDEFGH123
XXXX123
can I do: ".*123" ?
Yes, you can. That should work.
. = any char except newline
\. = the actual dot character
.? = .{0,1} = match any char except newline zero or one times
.* = .{0,} = match any char except newline zero or more times
.+ = .{1,} = match any char except newline one or more times
Yes that will work, though note that . will not match newlines unless you pass the DOTALL flag when compiling the expression:
Pattern pattern = Pattern.compile(".*123", Pattern.DOTALL);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.matches();
Use the pattern . to match any character once, .* to match any character zero or more times, .+ to match any character one or more times.
The most common way I have seen to encode this is with a character class whose members form a partition of the set of all possible characters.
Usually people write that as [\s\S] (whitespace or non-whitespace), though [\w\W], [\d\D], etc. would all work.
.* and .+ are for any chars except for new lines.
Double Escaping
Just in case, you would wanted to include new lines, the following expressions might also work for those languages that double escaping is required such as Java or C++:
[\\s\\S]*
[\\d\\D]*
[\\w\\W]*
for zero or more times, or
[\\s\\S]+
[\\d\\D]+
[\\w\\W]+
for one or more times.
Single Escaping:
Double escaping is not required for some languages such as, C#, PHP, Ruby, PERL, Python, JavaScript:
[\s\S]*
[\d\D]*
[\w\W]*
[\s\S]+
[\d\D]+
[\w\W]+
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex_1 = "[\\s\\S]*";
final String regex_2 = "[\\d\\D]*";
final String regex_3 = "[\\w\\W]*";
final String string = "AAA123\n\t"
+ "ABCDEFGH123\n\t"
+ "XXXX123\n\t";
final Pattern pattern_1 = Pattern.compile(regex_1);
final Pattern pattern_2 = Pattern.compile(regex_2);
final Pattern pattern_3 = Pattern.compile(regex_3);
final Matcher matcher_1 = pattern_1.matcher(string);
final Matcher matcher_2 = pattern_2.matcher(string);
final Matcher matcher_3 = pattern_3.matcher(string);
if (matcher_1.find()) {
System.out.println("Full Match for Expression 1: " + matcher_1.group(0));
}
if (matcher_2.find()) {
System.out.println("Full Match for Expression 2: " + matcher_2.group(0));
}
if (matcher_3.find()) {
System.out.println("Full Match for Expression 3: " + matcher_3.group(0));
}
}
}
Output
Full Match for Expression 1: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 2: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 3: AAA123
ABCDEFGH123
XXXX123
If you wish to explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
There are lots of sophisticated regex testing and development tools, but if you just want a simple test harness in Java, here's one for you to play with:
String[] tests = {
"AAA123",
"ABCDEFGH123",
"XXXX123",
"XYZ123ABC",
"123123",
"X123",
"123",
};
for (String test : tests) {
System.out.println(test + " " +test.matches(".+123"));
}
Now you can easily add new testcases and try new patterns. Have fun exploring regex.
See also
regular-expressions.info/Tutorial
No, * will match zero-or-more characters. You should use +, which matches one-or-more instead.
This expression might work better for you: [A-Z]+123
Specific Solution to the example problem:-
Try [A-Z]*123$ will match 123, AAA123, ASDFRRF123. In case you need at least a character before 123 use [A-Z]+123$.
General Solution to the question (How to match "any character" in the regular expression):
If you are looking for anything including whitespace you can try [\w|\W]{min_char_to_match,}.
If you are trying to match anything except whitespace you can try [\S]{min_char_to_match,}.
Try the regex .{3,}. This will match all characters except a new line.
[^] should match any character, including newline. [^CHARS] matches all characters except for those in CHARS. If CHARS is empty, it matches all characters.
JavaScript example:
/a[^]*Z/.test("abcxyz \0\r\n\t012789ABCXYZ") // Returns ‘true’.
I like the following:
[!-~]
This matches all char codes including special characters and the normal A-Z, a-z, 0-9
https://www.w3schools.com/charsets/ref_html_ascii.asp
E.g. faker.internet.password(20, false, /[!-~]/)
Will generate a password like this: 0+>8*nZ\\*-mB7Ybbx,b>
I work this Not always dot is means any char. Exception when single line mode. \p{all} should be
String value = "|°¬<>!\"#$%&/()=?'\\¡¿/*-+_#[]^^{}";
String expression = "[a-zA-Z0-9\\p{all}]{0,50}";
if(value.matches(expression)){
System.out.println("true");
} else {
System.out.println("false");
}

Getting words Starting with symbol in dart

I'm trying to parse in Dart long strings containing hashtags, so far I tried various combinations with regexp but I cannot find the right use.
My code is
String mytestString = "#one #two, #three#FOur,#five";
RegExp regExp = new RegExp(r"/(^|\s)#\w+/g");
print(regExp.allMatches(mytestString).toString());
The desidered output would be a list of hahstags
#one #two #three #FOur #five
Thankyou in advance
You should not use a regex literal inside a string literal, or backslashes and flags will become part of the regex pattern. Also, omit the left-hand boundary pattern (that matches start of string or whitespace) if you need to match # followed with 1+ word chars in any context.
Use
String mytestString = "#one #two, #three#FOur,#five";
final regExp = new RegExp(r"#\w+");
Iterable<String> matches = regExp.allMatches(mytestString).map((m) => m[0]);
print(matches);
Output: (#one, #two, #three, #FOur, #five)
String mytestString = "#one #two, #three#FOur,#five";
RegExp regExp = new RegExp(r"/(#\w+)/g");
print(regExp.allMatches(mytestString).toString());
This should match all of the hashtags, placing them into capture groups for you to later use.

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Regex to remove characters up to a certain point in a string

How do I use regex to convert
11111aA$xx1111xxdj$%%`
to
aA$xx1111xxdj$%%
So, in other words, I want to remove (or match) the FIRST grouping of 1's.
Depending on the language, you should have a way to replace a string by regex. In Java, you can do it like this:
String s = "11111aA$xx1111xxdj$%%";
String res = s.replaceAll("^1+", "");
The ^ "anchor" indicates that the beginning of the input must be matched. The 1+ means a sequence of one or more 1 characters.
Here is a link to ideone with this running program.
The same program in C#:
var rx = new Regex("^1+");
var s = "11111aA$xx1111xxdj$%%";
var res = rx.Replace(s, "");
Console.WriteLine(res);
(link to ideone)
In general, if you would like to make a match of anything only at the beginning of a string, add a ^ prefix to your expression; similarly, adding a $ at the end makes the match accept only strings at the end of your input.
If this is the beginning, you can use this:
^[1]*
As far as replacing, it depends on the language. In powershell, I would do this:
[regex]::Replace("11111aA$xx1111xxdj$%%","^[1]*","")
This will return:
aA$xx1111xxdj$%%
If you only want to replace consecutive "1"s at the beginning of the string, replace the following with an empty string:
^1+
If the consecutive "1"s won't necessarily be the first characters in the string (but you still only want to replace one group), replace the following with the contents of the first capture group (usually \1 or $1):
1+(.*)
Note that this is only necessary if you only have a "replace all" capability available to you, but most regex implementations also provide a way to replace only one instance of a match, in which case you could just replace 1+ with an empty string.
I'm not sure but you can try this
[^1](\w*\d*\W)* - match all as a single group except starting "1"(n) symbols
In Javascript
var str = '11111aA$xx1111xxdj$%%';
var patt = /^1+/g;
str = str.replace(patt,"");