Regex help: Matching paths (using django) - regex

Hate coming up with titles. I need something that'll actually capture the following:
site.com/500/ (a number as the first param)
site.com/500/ABC/ (a number and a 3 letter code)
site.com/500/ABC/DEF/ (a number and 2x 3 letter codes)
What I have been messing with:
^(\d+/)?(\w{3}/)?(\w{3}/)?$
That sort of works but includes the slashes in the arguments (so I end up with "500/"). Moving the slashes outside of the brackets won't match /500/ABC/ since the ? only works on the slash.
Obviously I can make it in multiple ones but I'm sure there's a way to do it in one go.
As well, I only want the actual arguments, since as I said it can work but ends up adding slashes to them, which isn't too good.
Thanks for any help.

how about ..
((\d+/)|(\d+/\w{3}/)|(\d+/\w{3}/\w{3}/))$
the result will be ..
site.com/500/ABC/DEF/ => 500/ABC/DEF/
site.com/500/ABC/ => 500/ABC/
site.com/500/ = 500/

Related

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Find/Replace string that doesn't contain quotes

I have inherited a rather large/ugly php codebase (language is unimportant, this is a generic vim question) , where nothing is quoted properly (old php doesn't mind, but new php versions throw warnings).
I'd like to turn $something[somekey] into $something['somekey'], only if its not already quoted or contain the character $
I was trying to build a regular expression to quote the keys, but just cant seem to be able get it to cooperate.
This is what i have so far, which doesn't work but maybe will help explain my question better. And to show that i have actually tried.
:%s/\v\$(.{-})\[(['"$]#<!.{-})\]/$\1['\2']/
My goal is to have something like this:
$something[somekey] = $something['somekey']
$somethingelse[someotherthing] = $something['someotherthing']
$another['key'] = $another['key'] (is ignored)
$yetanother["keykey"] = $yetanother["keykey"] (is ignored)
$derp[$herp] = $derp[$herp] (is ignored)
$array[3] = $array[3] (is ignored)
These can appear anywhere in text, even multiple on the same line, and even touching each other like $something[key]$something[key2], which i would like to be replaced with $something['key']$something['key2']
Another problem, there seems to be random javascript arrays in some files.. which have [] square brackets. So the regex needs to check to see if it starts with $ and text before the brackets.
Im probably asking for the impossible, but any help on this would be great before i go insane editing each file one by one manually.
EDIT: forgot that keys can be numeric, and shouldn't be quoted.
I tried the following, which processed everything from your question correctly:
:%s/\[\(\I\i*\)\]/['\1']/g
Or, with optional white spaces inside the parens:
:%s/\[\s*\(\I\i*\)\s*\]/['\1']/g
And also checking for $identifier before the parens:
:%s/\(\$\i\+\)\[\s*\(\I\i*\)\s*\]/\1['\2']/g

Replace a pattern based off of the integer in the pattern in Vim

I'm trying to convert a bunch of .textile files into their equivalent .markdown files.
I would like a vim search/replace command to replace all h1., h2., h3., etc. patterns with the associated number of # characters. So, h1. would become #, h2. would be come ## and so forth.
I think what I want to use is the \=repeat command, but I'm a bit lost as to what arguments to pass it.
Here is what I have so far. It replaces the correct matches, but it just deletes them and gives me errors:
:1,$s/h\d./\=repeat('#',submatch(0))
What are the proper arguments to pass to the \=repeat command?
this line may help you:
%s/\vh(\d)\./\=repeat('#',submatch(1))
you used submatch(0), it was the whole matched string : h and number and any char (here you had another problem, you should escape the period ), so it won't do what you were expecting.

using ^ (caret) inside the states in lex/flex

I'll put up my lex code first(lex body only).
%%
ps {BEGIN STATE1;}
. ;
<STATE1>^[0-9] print("number after ps".)
with this code I'm trying to match a number right after the letters "ps". Thats why I used ^ character.
But the code doesn't match any correct strings such as ps3, ps4fd,ps554 etc.
Then I removed the ^ and tried but then it worked but also matches strings like pserd7, psfh45,psfhdjh4er etc.
I know that I can solve the problem without using states (ps[0-9].*). But I have to do this with states. How can I fix this? thanks....
with this code I'm trying to match a number right after the letters "ps". Thats why I used ^ character
But ^ doesn't mean that. It means 'beginning of line'.
I know that I can solve the problem without using states (ps[0-9].*). But I have to do this with states.
Why? Very strange requirement.
You need to add more rules to cover the other possibilities. For example:
<STATE1>. { BEGIN INITIAL; }
But this depends on what else if anything is legal after 'ps'.

Perl replace every occurrence differently

In a perl script, I need to replace several strings. At the moment, I use:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/\>$1/g;
The aim is to format in a FASTA file every sequence name. It works well in my case so I don't need to touch this part. However, it happens that a sequence name appears several times in the file. I must not have at the end twice - or more - the same sequence name. I thus need to have for instance:
seqName1
seqName2
etc.
(instead of seqName, seqName, etc.)
Is this possible to somehow process differently every occurrence automatically? I don't know how many sequence there are, if there are similar names, etc. An idea would be to concatenate a random string at every occurrence for instance, hence my question.
Many thanks.
John perfectly solved it and chepner helped with the smart idea to avoid conflicts, here is the final result:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
return '>'.$1.$i++;
}->();
/eg;
Many many thanks.
I was actually trying to do something like this the other day, here's what I came up with
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
# return random string
}->();
/eg;
the \e modifier interprets the substitution as code, not text. I use an anonymous code ref so that I can return at any point.