Extracting movie name and year from string were year is optional - regex

I'm missing a really obvious thing here, but I'm new to regex so be kind ;-)
I have a number of films in an arbitrary format that may or may not have the year attached.
My Movie Name 2010
Some.Other.Super.Cool.Movie
The~Third|Movie.2010
Now, using (.+)\W(\d{4}) I can extract the two movies with dates into two groups one containing the name and the other the year, but the middle one gets ignored? I'm just a little unsure on how to actually make the year segment optional.
Ideally, ;-), I could use a single expression to return the names with \W converted into spaces but that a different conversation.
Thanks in advance

using a ? after the a character group will make it optional so in your case after the (\d{4})
(.+)\W(\d{4})?
That is because you are using greedy matching on (.+) and \W includes the new line character in it's set ( I think it does at least ). Strip your string of trailing whitespace and if that doesn't work make (.+) lazy with a ? of it's own, (.+?) - Also consider that \W may be the wrong delimiter for this problem.
Also adding $ to the end may help, as that would require the digits to end the function is they can, try lazing matching and $.
(.+?)\W(\d{4})?$

? Makes it optional
(.+?)\W?(\d{4})?$

Related

regex match combined with something before a string if exists

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!
You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.
Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

Username cannot contain repeating underscore or period

I have always struggled with these darn things. I recall a lecturer telling us all once that if you have a problem which requires you use regular expressions to solve it, you in fact now have 2 problems.
Well, I certainly agree with this. Regex is something we don't use very often but when we do its like reading some alien language (well for me anyway)... I think I will resolve to getting the book and reading further.
The challenge I have is this, I need to validate a username based on the following criteria:
can contain letters, upper and lower
can contain numbers
can contain periods (.) and underscores (_)
periods and underscores cannot be consecutive i.e. __ .. are not allowed but ._._ would be valid.
a maximum of 20 characters in total
So far I have the following : ^[a-zA-Z_.]{0,20}$ but of course it allows repeat underscores and periods.
Now, I am probably doing this all wrong starting out with the set of valid characters and max length. I have been trying (unsuccessfully) to create some look-around or look-behind or whatever to search for invalid repetitions of period (.) and underscore (_) not sure what the approach or methodology to break down this requirement into a regex solution is.
Can anyone assist with a recommendation / alternative approach or point me in the right direction?
This one is the one you need:
^(?:[a-zA-Z0-9]|([._])(?!\1)){5,20}$
Edit live on Debuggex
You can have a demo of what it matches here.
"Either an alphanum char ([a-zA-Z0-9]), or (|) a dot or an underscore ([._]), but that isn't followed by itself ((?!\1)), and that from 5 to 20 times ({5,20})."
(?:X) simply is a non-capturing group, i.e. you can't refer to it afterwards using \1, $1 or ?1 syntaxes.
(?!X) is called a negative lookahead, i.e. literally "which is not followed by X".
\1 refers to the first capturing group. Since the first group (?:...){5,20} has been set as non-capturing (see #1), the first capturing group is ([._]).
{X,Y} means from X to Y times, you may change it as you need.
Don't try to shove this into a single regex. Your single regex works fine for all criteria except #4. To do #4, just do a regex that matches invalid usernames and reject the username if it matches. For example (in pseudocode):
if username.matches("^[a-zA-Z_.]{0,20}$") and !username.matches("__|\\.\\.") {
/* accept username */
}
You can use two negative lookahead assertions for this:
^(?!.*__)(?!.*\.\.)[0-9a-zA-Z_.]{0,20}$
Explanation:
(?! # Assert that it's impossible to match the following regex here:
.* # Any number of characters
__ # followed by two underscores in a row
) # End of lookahead
Depending on your requirements and on your regex engine, you may replace [0-9A-Za-z_.] with [\w.].
#sp00n raised a good point: You can combine the lookahead assertions into one:
^(?!.*(?:__|\.\.))[0-9a-zA-Z_.]{0,20}$
which might be a bit more efficient, but is a little harder to read.
For your answer above
I've tried to do what it you says on the account but it still says
The account name shall be a combination of letter, number or underscore
then after i am try do that then app reject that account
So write me a sample of the correct registration data according to the name I want to register is PACIFIC CONCORD INTERNATIONAL
And put signs and underscores on this name correctly so that the site accepts it
Thank you

REGEX - nesting quantifiers in combined statements

A last name in Hebrew can be either in an English format, which is just a regular combination of letters, like "Smith", "Camp", "Jack" etc, or a combination of two words with a space in the middle, like "Ben David", "Bar Yohay", "Yom Tov". i tried to create a regexp that allows either the first format - a last name that is at least two letters long, or the second one - a last name that is composed of two words, each two or more letters long, with a space in the middle. here is what i came up with:
(^[a-z]{2,}$)|((^[a-z]{2,}$)(^[ ]$)(^[a-z]{2,}$))
(I know it does not allow capital letters)
For some reason it does allow names of the first format like Smith and Jerry, but does not allow names of the second one. is there a problem with the formatting of the space in the middle? This should be an easy one for regexp professionals. thanks in advance :)
You can simplify your regex to
^[a-z]{2,}(?: [a-z]{2,})?$
You are misusing anchors (^ and $). These match the beginning and ending of the string, respectively. What you actually want is:
(^[a-z]{2,}$)|(^([a-z]{2,})([ ])([a-z]{2,})$)
Further, you can simplify your expression to:
^[a-z]{2,}$|^[a-z]{2,} [a-z]{2,}$
unless you specifically need to capture groups.
Or (so you only need one pair of anchors):
^(?:[a-z]{2,}|[a-z]{2,} [a-z]{2,})$
(?:...) is a non-capturing group, necessary to restrict the scope of the alternation.

regex to grep all numbers after the second-last underscore

I want to get all the character's after the second last underscore in a string any ideas how this could be accomplished
Input Output
PART1_PART2_PART3_G2010 PART3_G2010
any idea what the regex should look like
.*_([^_]*_[^_]*)$
Isn't bound to a specific total count of parts between the underscores, like the regex of Andrea Spadaccini is.
edit
The first two symbols .* capture every character, cause . captures one arbitrary character and * is a quantifier for "as much as possible". Then, a underscore should appear.
The expression in brackets should capture the two parts between underscores. Well, at first, we capture all (again the *) non-underscore-characters:
This is done using the square brackets and saying we want any character except (^) the underscore => [^_]. The very last symbol $ defines the end of the input string. I think it is possible to either leave this OR the .* in the beginning out...
Andrea Spadaccini's answer works if you know that the input has three underscores. If the question was meant more generally, referring to everything after the second underscore independent of how many underscores come before that, the regex needs to search from the end ($) like this:
_([^_]*_[^_]*)$
First N not-underscores, than an underscore. Repeat. Group the last characters.
[^_]*_[^_]*_(.*)

How to match everything up to the second occurrence of a character?

So my string looks like this:
Basic information, advanced information, super information, no information
I would like to capture everything up to second comma so I get:
Basic information, advanced information
What would be the regex for that?
I tried: (.*,.*), but I get
Basic information, advanced information, super information,
This will capture up to but not including the second comma:
[^,]*,[^,]*
English translation:
[^,]* = as many non-comma characters as possible
, = a comma
[^,]* = as many non-comma characters as possible
[...] is a character class. [abc] means "a or b or c", and [^abc] means anything but a or b or c.
You could try ^(.*?,.*?),
The problem is that .* is greedy and matches maximum amount of characters. The ? behind * changes the behaviour to non-greedy.
You could also put the parenthesis around each .*? segment to capture the strings separately if you want.
I would take a DRY approach, like this:
^([^,]*,){1}[^,]*
This way you can match everything until the n occurrence of a character without repeating yourself except for the last pattern.
Although in the case of the original poster, the group and repetition of the group is useless I think this will help others that need to match more than 2 times the pattern.
Explanation:
^ From the start of the line
([^,]*,) Create a group matching everything except the comma character until it meet a comma.
{1} Count the above pattern (the number of time you need)-1. So if you need 2 put 1, if you need 20 put 19.
[^,]* Repeat the pattern one last time without the tailing comma.
Try this approach:
(.*?,.*?),.*
Link to the solution