I have a specific case where I somehow can't find something that suits my need. I've always been struggling when parenthesis comes in strong, and this case is a bit painful. I'm trying to collect the most of a text field to fit it in a more controlled database, and there's a few tricks I'm fumbling in.
There is ONE thing that is always the case for every row entered;
serie of character + ( + text + )
Basically, here's what it could look like:
1111111E (CARRIER), 2222222, 33333 (CARRIER2) 44444 (CARRIER 3)
My goal is to get:
1111111E (CARRIER)
2222222, 33333 (CARRIER2)
44444 (CARRIER 3)
And if I can ever manage to get a hold of commas and space to split the few like the middle one, that would be just amazing.
I'm struggling through a few REGEX tester website as I'm writing this, starting from scratch over and over again.
If some regex gurus are around, you're a welcome hand !
If it has to be RegEx you could split at
(?<=\))[, ]*
Note that as you don't want to take out the ")" you must not match it and thus the statement uses a look behind which does not work in all RegEx engines.
[^\s|\,].*?\s\(.*?\)
With a Match All is doing the expected result. I doubt it's the most optimal regex I could type in, but it seems to be working fine.
I could try to work around the second case to wrap it up, but I think I'll take care of these case in my code.
Leaving the answer up for anybody who could be looking into something similar.
Related
I appreciate what I'm asking may be very simple to more experienced folks. I've spent several hours trying to get my head around RegEx and have gotten close to what I need, but as this is something I'm trying to achieve for a hobby project (RegEx is not something I require in my day job) I'm hoping some of you may be able to help me out.
In short, I have a very large file with tens of thousands of lines of code that I am converting to be readable by another program. All I need to accomplish this is to change some formatting.
I need to find every instance where the tag "{#graphic examplename}" is used, and change it so that only "examplename" remains in square [[ ]] brackets.
Examples of how the tags currently appear (example names can be either single words or multiple):
"{#graphic example1}",
"{#graphic example2}",
"{#graphic example3 with multiple words}"
What I want them to look like when done, replacing the { with [[, removing #graphic, and replacing } with ]].
"[[example1]]",
"[[example2]]",
"[[example3 with multiple words]]"
It's easy enough to do a simple find-and-replace to replace "{#graphic " with "[[", as the #graphic tag is something I want to remove universally however the issue I'm running into is that I can't replicate that with the "}" at the end, because I can't find a way to specify that I only want to replace examples of "}" that come after an instance of "{#graphic " while leaving any other words (the examplename) intact.
Any assistance gratefully received - if the above needs any elaboration please don't hesitate to ask, I understand I may be putting this in amateurish terms.
Regards,
K
Often programs have a way of capturing groups and referencing them later, often with $
so find {#graphic ([^}]+)}
replace [[$1]]
Captures what is inside the () and makes it available in the replace as $1, i.e. the first "capture group".
Regex 101 is an excellent resource for trying these things out:
https://regex101.com/r/r5OX8I/1
I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.
I'm not a regex expert, so please be nice :-)
I created this regex to verify if a user submitted a day of the week (in italian language):
/((lun|mart|giov)e|mercol(e?)|vener)d(ì|i('?)|í)|sabato|domenica/
This regex perfectly works and it matches the following:
lunedi
lunedì
lunedí
lunedi’
martedi
martedì
martedí
martedi'
mercoledi
mercoledì
mercoledí
mercoledi'
mercoldi
mercoldì
mercoldí
mercoldi'
giovedi
giovedì
giovedí
giovedi'
venerdi
venerdì
venerdí
venerdi'
sabato
domenica
Now consider the first part of the regex and focus on venerdì: as you can see, I added an OR (|) just to manage the venerdì day, just because of the presence of that “r”.
Anything works just fine but I’m here to ask if is there any way to start the regex this way:
(lun|mar|giov|ven)e
and then manage that “r” some way.
I red about backrefences and conditionals but I’m not sure they can be of any help.
My idea is something like: “if the first group captured ‘ven’, than add “r” to the “e” right after the end of the group.
Is this possible?
Don't "golf" your regex. If you want to improve it at all, make it more readable. While it it certainly worthwile to use different cases for the different "i" variants, everything else should IMHO be kept as simple as possible.
How about something like this?
(lune|marte|mercole?|giove|vener)d(ì|i'?|í)|sabato|domenica
Don't use backreferences and other advanced features if you don't need them, just to make your regex a few chars shorter. Even if you would still understand what it means, think about your fellow co-developers -- or just yourself two months from now.
I just removed a few redundant (...) and the "shared e" part. Note how (besides the (...)) it is the same length, whether you use (lun|mart|giov)e or lune|marte|giove, but the latter is arguably more readable. Similarly, a backreference or some conditional would likely make your regex longer instead of shorter -- and considerably more complicated.
I run a small forum that has an issue with people using parentheses to bracket statements. They do it to signify they are talking about Jews. I guess it is called echoes or something. So they will put a name like (((Prominent Person))) like that in the middle of a conversation.
I have recently been trying to combat this without just banning people that can't behave. I have a decent word filter but it doesn't block that. I recently installed something that allows me to use regex to strip things out but I am having trouble finding the proper string that doesn't break everything else.
"/\W{3}(.*)\W{3}/","$1"
The first is the matching string and the comma separates what is left. This string works, it strips the parentheses out and leaves everything else alone. The problem is that the string is too broad. It also strips out any [ brackets as well which breaks all of the bbcode in a post. Any post that has any number of at least 3 brackets will be broken after that.
I have been playing with different strings on regex101 but not finding the best solution. I need any time that ((( or ))) is seen to strip out those and replace it with nothing, like it never happened. It has to be exactly three and only ((( and not the other brackets it could trigger on.
Does anyone have a good solution?
\({3}(.*)\){3}
https://regex101.com/r/wD5TMb/1
So in your format probably: "/\({3}(.*)\){3}/","$1"
I'm a Regex noob and am pretty sure I'm not going about this in the most efficient way - wanted to get some advice.
I have a Regex expression ((\w+\b.*?){100}){1} which selects the first 100 words of my string, the length of which varies.
What I want is to select the entire string except for the first 100 words.
Is there syntax I can add to my current expression to do this, or am I better off trying to directly select the rest of the text instead.
Also, if anyone has any good resources for improving my Regex knowledge, i'd be very appreciative. Thus far I've found http://gskinner.com/RegExr/ to be very helpful.
Thanks in advance!
If you use this, you can refer to everything else as group 3 noted as $3
This one will treat hyphenated words as one word.
(\w+(-\w+|\b).*?){100}(.*)
Regex training Here