What is the regular expression to accept only text, number and backslash. It should not accept space and should start with text only. For example domain\username. Thanks in advance...
this is a regex for domain\name with the restriction that 'domain' should start with a char and end with a char. You can easily maniplate the regex for your desire
/^[a-zA-Z][a-zA-Z0-9-]{1,61}[a-zA-Z]\.[a-zA-Z]{2,}$/
Domain - Beginning:
[a-zA-Z] Text
Domain - Text:
1-61 times of [a-zA-Z0-9-] Text, Numbers, '-'
Domain - End:
1 time [a-zA-Z] = Text
Backslash:
1 time [\]
User - Text:
2-infinity times [a-zA-Z] = Text
Edit: as bgh pointed out in the comment you could include more valid characters
/^[a-zA-Z][a-zA-Z0-9\-\.]{0,61}[a-zA-Z]\\\w[\w\.\- ]*$/
The following is a regex with named groups, this can be pasted into Linqpad and run. Make note that actually a lot of characters are allowed in user names in Active Directory, actually any Unicode character save for some special characters (of which are used in LDAP searches among others).
Oh yes - English alphabet ends in Z. In Norwegian language we have three extra vowels: Æ,Ø,Å.
void Main()
{
string user = "someaddomain\\someuser99";
var matches = Regex.Match(user, #"^(?<domain>[a-æA-Æ0-9-]+)\\(?<username>[a-æA-Æ0-9-]+)$").Dump();
string[] comps = user.Split('\\');
comps.Dump();
matches.Groups["domain"].Value.Dump();
matches.Groups["username"].Value.Dump();
}
Linqpad is available for download at for those new programmers who has not used this development tool yet:
[enter link description here][1]
[1]: https://www.linqpad.net Linqpad website
Related
I want to have accurate form field validation for NEAR protocol account addresses.
I see at https://docs.near.org/docs/concepts/account#account-id-rules that the minimum length is 2, maximum length is 64, and the string must either be a 64-character hex representation of a public key (in the case of an implicit account) or must consist of "Account ID parts" separated by . and ending in .near, where an "Account ID part" consists of lowercase alphanumeric symbols separated by either _ or -.
Here are some examples.
The final 4 cases here should be marked as invalid (and there might be more cases that I don't know about):
example.near
sub.ex.near
something.near
98793cd91a3f870fb126f66285808c7e094afcfc4eda8a970f6648cdf0dbd6de
wrong.near.suffix (INVALID)
shouldnotendwithperiod.near. (INVALID)
space should fail.near (INVALID)
touchingDotsShouldfail..near (INVALID)
I'm wondering if there is a well-tested regex that I should be using in my validation.
Thanks.
P.S. Originally my question pointed to what I was starting with at https://regex101.com/r/jZHtDA/1 but starting from scratch like that feels unwise given that there must already be official validation rules somewhere that I could copy.
I have looked at code that I would have expected to use some kind of validation, such as these links, but I haven't found it yet:
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/utils/account.js#L8
https://github.com/near/near-wallet/blob/40512df4d14366e1b8e05152fbf5a898812ebd2b/packages/frontend/src/components/accounts/AccountFormAccountId.js#L95
https://github.com/near/near-cli/blob/cdc571b1625a26bcc39b3d8db68a2f82b91f06ea/commands/create-account.js#L75
The pre-release (v0.6.0-0) version of the JS SDK comes with a built-in accountId validation function:
const ACCOUNT_ID_REGEX =
/^(([a-z\d]+[-_])*[a-z\d]+\.)*([a-z\d]+[-_])*[a-z\d]+$/;
/**
* Validates the Account ID according to the NEAR protocol
* [Account ID rules](https://nomicon.io/DataStructures/Account#account-id-rules).
*
* #param accountId - The Account ID string you want to validate.
*/
export function validateAccountId(accountId: string): boolean {
return (
accountId.length >= 2 &&
accountId.length <= 64 &&
ACCOUNT_ID_REGEX.test(accountId)
);
}
https://github.com/near/near-sdk-js/blob/dc6f07bd30064da96efb7f90a6ecd8c4d9cc9b06/lib/utils.js#L113
Feel free to implement this in your program too.
Something like this should do: /^(\w|(?<!\.)\.)+(?<!\.)\.(testnet|near)$/gm
Breakdown
^ # start of line
(
\w # match alphanumeric characters
| # OR
(?<!\.)\. # dots can't be preceded by dots
)+
(?<!\.) # "." should not precede:
\. # "."
(testnet|near) # match "testnet" or "near"
$ # end of line
Try the Regex out: https://regex101.com/r/vctRlo/1
If you want to match word characters only, separated by a dot:
^\w+(?:\.\w+)*\.(?:testnet|near)$
Explanation
^ Start of string
\w+ Match 1+ word characters
(?:\.\w+)* Optionally repeat . and 1+ word characters
\. Match .
(?:testnet|near) Match either testnet or near
$ End of string
Regex demo
A bit broader variant matching whitespace character excluding the dot:
^[^\s.]+(?:\.[^\s.]+)*\.(?:testnet|near)$
Regex demo
I have always used stackoverflow for solving many of my problems by searching the threads. Today I would like some guidance on creating a regex pattern for my text files. My files have headings that are varied in nature and do not follow the same naming pattern. The pattern they do follow somewhat is like this:
2.0 DESCRIPTION
3.0 PLACE OF PERFORMANCE
5.0 SERVICES RETAINED
6.0 STRUCTURE AND ROLES
etc....
It always follows a number and then capital letters or number and then spaces and then capital letters. The output I need is a list :
output = ['2.0 DESCRIPTION','3.0 PLACE OF PERFORMANCE','5.0 SERVICES RETAINED','6.0 STRUCTURE AND ROLES']
I am extremely new to python and regex. I tried the following but it did not give me the output desired:
import re
text = f'''2.0 DESCRIPTION
some text here
3.0 SERVICES
som text
5.0 SERVICES RETAINED
some text
6.0 STRUCTURE AND ROLES
sometext'''
pattern = r"\d\s[A-Z][A-Z]+"
matches = re.findall(pattern,text)
But it returned:
['0 DESCRIPTION', '0 SERVICES', '0 SERVICES']
Not the output that I was looking for. Your guidance in finding a pattern will be really appreciated.
Cheers,
Abhishek
You may use
matches = re.findall(r'^\d+(?:\.\d+)* *[A-Z][A-Z ]*$',text, re.M)
See the regex demo.
Here,
^ - start of a line (re.M redefines ^ behavior to include these positions, too)
\d+(?:\.\d+)* - 1+ digits and then 0+ sequences of a . and 1+ digits
* - zero or more spaces
[A-Z][A-Z ]* - an uppercase letter and then 0 or more uppercase letters or spaces
$ - end of a line.
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extractenter code here" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.
I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :
need some help from a regex jedi master:
If I have a string of mb chars (specifically, Japanese, Korean or Chinese) with English words sprinkled throughout, I would like to count:
asian characters as 1 per single char
english "words" (no dictionary check needed - just a string of consecutive english letters) as a single char.
English only is fine - don't worry about special spanish, swedish, etc. chars.
I am searching for a regex pattern I can use to count these strings, that will function in php and js.
Example:
これは猫です、けどKittyも大丈夫。
should count as 13 chars.
thanks for your help!
jeff
What ever you are trying to achieve, this will help you:
To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks):
var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫
Updated:
To count only words in Alphabet:
x.match(/\w+/g).length; //Result: 1 : "Kitty"
All in one line (as function):
function myCount(str) {
return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13
alert(myCount("これは犬です。DogとPuppyもOKですね!")); //14
These are the arrays resulted of match:
["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]
Updated (JAP, KOR, CH):
function myCount(str) {
return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ヲ-ン]|\w+/g).length;
}
These will cover around 99% of the Japanese, Chinese and Korean. You may need to manually add extra characters that are not included such as "〶".
A very good reference is:
http://www.tamasoft.co.jp/en/general-info/unicode.html
This should solve your question.
OK, so I would do two runs: First count the occurrences of the English words and then of the Asian ones. This is a JS example, it might be different in PHP. In JS, only ASCII chars match \w.
string = "これは猫です、けどKittyも大丈夫";
var m = string.match(/\w+/gm);
var e_count = m.length; // is 1
Next count the Asian chars.
m = string.match(/([^\w\s\d])/gm); // any non-whitespace, non-word, non-digit chars
var a_count = m.length; // is 13
You might have to tweak it a bit. But in JS, you can add up e_count and a_count, and you should be good to go.
Also check out Rubular: http://www.rubular.com
Johannes
Something like /[[:ascii:]]+|./ will match one non-ASCII character or one or more ASCII characters. Probably is that'll give 15. So it seems that you want to ignore punctuation. So possibly: /[A-Za-z]+|[^[:punct:]]/
$ perl -E 'use utf8; $f = "これは猫です、けどKittyも大丈夫。"; ++$c while $f =~ /[A-Za-z]+|[^[:punct:]]/g; say $c'
13
So, that works in Perl at least. Probably in JS and PHP as well, provided their [[:punct:]] understands Unicode.
The alternative approach is to filter out stuff instead.
I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.