regex for hashtags - regex

I have found a lot of regex examples to retrieve hashtags from a text. Unfortunately, none of examples is what I need.
This is almost what I need but...
function hashtags(text) {
return text.replace(/(^|\s)#(\w*[a-zA-Z]+\w{2,50})/g,
"$1<a href='/h/$2' target='_blank'>#$2</a>");
}
Hashtags can not start with a number to avoid situations when for example Section #12 gets hashtaged.
The example above checks it but it does not allow characters like ÁÉÍÚ, it does not check the hashtag lenghth correctly and it does not allow character '-'.
So, I need the following:
A hashtag may start with any letter - A,z,B,Ñ,ó,Ú etc, but not with a number and not with a special sign &%$ or - _
The total lenght of a hashtag must be 3-50 characters. The regex must accept as hashtags only full words but not to cut them after first 50 characters. So, words that start with # but contain more than 50 characters must be ignored instead of converting first 50 characters into a hashtag link. In my example {2,50} does not work correct.
The rest part of a hashtag (when checked that it does not start with a number or a special sign) may contain numbers, any letters and _ - signs. \w allows only _ but not -
Is it possible?

For 1 - you need a character class. You can define these with square brackets. PCRE defines \w but that includes numbers too.
For 2 - You can either have a word followed by 'some whitespace' (PCRE: \s) or use a look around pattern (?![A-Z0-9]) - for 'not followed by this.
And for 3 - non-whitespace may be what you want - \S in PCRE definitions.
/(?<!\w)#[A-Z]\S{1,49}(?!\w)/i
Demo
Edit: Given this may be javascript specific, and you can't use lookbehind, then the above may not work for you. If you are tying our regex query to a particular language, it is useful to specify that constraint in the question.

Try this one:
/(^|\s)#([^\d&%$_-]\S{2,49})\b/g
Explaining:
(^|\s) #
#([^\d&%$_-] # not the characters you mentioned in the first position
\S{2,49}) # the first chracter was already matched
\b # a boundary to avoid overflow 50 characters
Hope it helps.

Related

Regex: Exclude numbers when matching hashtags

I have the following regex /#(\w+)/g which I am using to identify hashtags in a video description. This works however it also picks up numbered lists, i.e., #2. How can I exclude these while still detecting hashtags?
Here is a more detailed example of what I want to include and exclude:
https://regex101.com/r/PGsAfh/5
You can use this regex:
#\w*[a-zA-Z]\w*
It basically means that after the #, you can have any word character you like \w*, but there's gotta be a letter [a-zA-Z] somewhere. I have used * to allow the letter to appear at the start and end of the hashtag, and I have put \w* on both sides to allow numbers to be at the start and end of the hashtag.
Demo

Regex that excludes spaces and requires 2 capital letters or more

I'm trying to create a regular expression that matches strings with:
19 to 90 characters
symbols
at least 2 uppercase alphabetical characters
lowercase alphabetical characters
no spaces
I already know that for the size and space exclusion the regex would be:
^[^ ]{19,90}$
And I know that this one will match any a string with at least 2 uppercase characters:
^(.*?[A-Z]){2,}.*$
What I don't know is how to combine them. There is no context for the strings.
Edit: I forgot to say that it is better ifthe regex excludes strings that finish with a .com or .jpeg or .png or any .something (that "something" being of 2-5 characters).
This regex should do what you want.
^(?=(?:\w*\W+)+\w*$)(?=(?:\S*?[A-Z]){2,}\S*?$)(?=(?:\S*?[a-z])+\S*?$)(?!.*?\.\w{2,5}$).{19,90}$
Basically it uses three positive lookaheads and a negative lookahead to guarantee the conditions that you specified:
(?=(?:\w*\W+)+\w*$)
ensures that there is at least one non-word (symbol) character
(?=(?:\S*?[A-Z]){2,}\S*?$)
ensures that there are at least two uppercase characters, and also excludes a match if there are any spaces in the string
(?=(?:\S*?[a-z])+\S*?$)
ensures that there is at least one lowercase character in the string. The negative lookahead
(?!.*?\.\w{2,5}$)
ensures that strings that end with a . and 2-5 characters are excluded
Finally,
.{19.90}
performs the actual match and ensures that there are between 19 and 90 characters.
Following your requrements, I suggest to use the following pattern:
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s]).{19,90}$
Demo
Instead of just excluding spaces, I used \ssince you probably don't want allow tabs, newlines, etc. either. However, it is still unclear which symbols you want to allow, e.g. [a-zA-Z!"§$%&\/()=?+]
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s])(?=[a-zA-Z!"§$%&\/()=?+]).{19,90}$
To match your additional requirement not to match file-like extensions at the end of the string, add a negative look-ahead: (?!.*\.\w{2,5}$)
^(?=.*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[^\s])(?=[a-zA-Z!"§$%&\/()=?+]).{19,90}$
Demo2
You can use backreferences as described here: https://www.ocpsoft.org/tutorials/regular-expressions/and-in-regex/
Another reference with examples here: https://www.regular-expressions.info/refcapture.html

Regex - special characters and numbers - PHP and Javascript

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

Username cannot contain repeating underscore or period

I have always struggled with these darn things. I recall a lecturer telling us all once that if you have a problem which requires you use regular expressions to solve it, you in fact now have 2 problems.
Well, I certainly agree with this. Regex is something we don't use very often but when we do its like reading some alien language (well for me anyway)... I think I will resolve to getting the book and reading further.
The challenge I have is this, I need to validate a username based on the following criteria:
can contain letters, upper and lower
can contain numbers
can contain periods (.) and underscores (_)
periods and underscores cannot be consecutive i.e. __ .. are not allowed but ._._ would be valid.
a maximum of 20 characters in total
So far I have the following : ^[a-zA-Z_.]{0,20}$ but of course it allows repeat underscores and periods.
Now, I am probably doing this all wrong starting out with the set of valid characters and max length. I have been trying (unsuccessfully) to create some look-around or look-behind or whatever to search for invalid repetitions of period (.) and underscore (_) not sure what the approach or methodology to break down this requirement into a regex solution is.
Can anyone assist with a recommendation / alternative approach or point me in the right direction?
This one is the one you need:
^(?:[a-zA-Z0-9]|([._])(?!\1)){5,20}$
Edit live on Debuggex
You can have a demo of what it matches here.
"Either an alphanum char ([a-zA-Z0-9]), or (|) a dot or an underscore ([._]), but that isn't followed by itself ((?!\1)), and that from 5 to 20 times ({5,20})."
(?:X) simply is a non-capturing group, i.e. you can't refer to it afterwards using \1, $1 or ?1 syntaxes.
(?!X) is called a negative lookahead, i.e. literally "which is not followed by X".
\1 refers to the first capturing group. Since the first group (?:...){5,20} has been set as non-capturing (see #1), the first capturing group is ([._]).
{X,Y} means from X to Y times, you may change it as you need.
Don't try to shove this into a single regex. Your single regex works fine for all criteria except #4. To do #4, just do a regex that matches invalid usernames and reject the username if it matches. For example (in pseudocode):
if username.matches("^[a-zA-Z_.]{0,20}$") and !username.matches("__|\\.\\.") {
/* accept username */
}
You can use two negative lookahead assertions for this:
^(?!.*__)(?!.*\.\.)[0-9a-zA-Z_.]{0,20}$
Explanation:
(?! # Assert that it's impossible to match the following regex here:
.* # Any number of characters
__ # followed by two underscores in a row
) # End of lookahead
Depending on your requirements and on your regex engine, you may replace [0-9A-Za-z_.] with [\w.].
#sp00n raised a good point: You can combine the lookahead assertions into one:
^(?!.*(?:__|\.\.))[0-9a-zA-Z_.]{0,20}$
which might be a bit more efficient, but is a little harder to read.
For your answer above
I've tried to do what it you says on the account but it still says
The account name shall be a combination of letter, number or underscore
then after i am try do that then app reject that account
So write me a sample of the correct registration data according to the name I want to register is PACIFIC CONCORD INTERNATIONAL
And put signs and underscores on this name correctly so that the site accepts it
Thank you

Two Regular Expression Problems

1- I'm planning to use the regEx to validate user first and last name inputs using this regex:
/^[a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð ,.'-]+$/u
However I don't want to allow underscore "_", no only empty space (cannot be left blank) and at least 2 characters. How can I appy them to the regEx above ?
2- For my strong password input validation, I need it be of minimum 8 character length
and it should consist of at least one letter and non-letter ( For e.g. qsgtest123, qsgtest!##)
I will be grateful if you help me with these 2 regExs.
Have a try with:
/^[\p{L},.'-]+[\p{L} ,.'-]*[\p{L},.'-]+$/u
/^((?!_)[a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð ,.'-])+$/u
The above should apply to your first question.
This for the name
/^(?! +$)[a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð ,.'-]{2,}$/u
The only difference is the "at least 2 characters" at the end and (?! +$) that means "fail if there are only spaces and end of the string".
Tester: http://gskinner.com/RegExr/?2uv74
And this one for the password:
/^(?=.*[a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð])(?=.*[^a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð]).{8,}$/u
(I'm using your definition of "letter" :-) ). It means:
look forward if present any character any number of times followed by a "letter"
look forward if present any character any number of times followed by a "non-letter"
(these two look forward don't "move" the regex cursor, that is still at the first character)
match any character 8 or more times
I see you are using the /u at the end of the regex. You are probably using Perl. To match any letter you should use \p{L} (and to match any non-letter you should use \P{L}) instead of writing long lists of characters. So the first one would become:
/^(?! +$)[\p{L} ,.'-]{2,}$/u
and the password one:
/^(?=.*\p{L})(?=.*\P{L}).{8,}$/u
And we will ignore the composable diacritics of Unicode :-)
Unless you'd prefer to include them... Then
/^(?! +$)(?=.{2,})(\p{L}\p{M}*|[ ,.'-])*$/u
(we pre-check the absence of all-spaces and the minimum length, and then we check that all the string is composed of letters (each one with an optional zero or more combining mark) or the other symbols in the [])